# selfsupervised_gans_with_label_augmentation__ef20ccfa.pdf Self-Supervised GANs with Label Augmentation Liang Hou1,3, Huawei Shen1,3, Qi Cao1, Xueqi Cheng2,3 1Data Intelligence System Research Center, Institute of Computing Technology, Chinese Academy of Sciences 2CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences 3University of Chinese Academy of Sciences {houliang17z,shenhuawei,caoqi,cxq}@ict.ac.cn Recently, transformation-based self-supervised learning has been applied to generative adversarial networks (GANs) to mitigate catastrophic forgetting in the discriminator by introducing a stationary learning environment. However, the separate self-supervised tasks in existing self-supervised GANs cause a goal inconsistent with generative modeling due to the fact that their self-supervised classifiers are agnostic to the generator distribution. To address this problem, we propose a novel self-supervised GAN that unifies the GAN task with the self-supervised task by augmenting the GAN labels (real or fake) via self-supervision of data transformation. Specifically, the original discriminator and self-supervised classifier are unified into a label-augmented discriminator that predicts the augmented labels to be aware of both the generator distribution and the data distribution under every transformation, and then provide the discrepancy between them to optimize the generator. Theoretically, we prove that the optimal generator could converge to replicate the real data distribution. Empirically, we show that the proposed method significantly outperforms previous self-supervised and data augmentation GANs on both generative modeling and representation learning across benchmark datasets. 1 Introduction Generative adversarial networks (GANs) [14] have developed rapidly in synthesizing realistic images in recent years [4, 23, 24]. Conventional GANs consist of a generator and a discriminator, which are trained adversarially in a minimax game. The discriminator attempts to distinguish the real data from the fake ones that are synthesized by the generator, while the generator aims to confuse the discriminator to reproduce the real data distribution. However, the continuous evolution of the generator results in a non-stationary learning environment for the discriminator. In such a non-stationary environment, the single discriminative signal (real or fake) provides unstable and limited information to the discriminator [37, 7, 70, 57]. Therefore, the discriminator may encounter catastrophic forgetting [25, 6, 55], leading to the training instability of GANs and thereby mode collapse [52, 61, 33, 36] of the generator in learning high-dimensional, complex data distributions. To alleviate the forgetting problem of the discriminator, transformation-based self-supervised tasks such as rotation recognition [13] have been applied to GANs [6]. Self-supervised tasks are designed to predict the constructed pseudo-labels of data, establishing a stationary learning environment as the distribution of training data for this task does not change during training. The discriminator enhanced by the stationary self-supervision is able to learn stable representations to resist the forgetting problem, and thus can provide continuously informative gradients to obtain a well-performed generator. After obtaining empirical success, self-supervised GAN was theoretically analyzed and partially improved by a new self-supervised task [57]. However, these self-supervised tasks cause a goal inconsistent with 35th Conference on Neural Information Processing Systems (Neur IPS 2021). generative modeling that aims to learn the real data distribution. In addition, the hyper-parameters that trade-off between the original GAN task and the additional self-supervised task bring extra workload for tuning on different datasets. To sum up, we still lack a proper approach to overcome catastrophic forgetting and improve the training stability of GANs via self-supervised learning. In this paper, we first point out that the reason for the presence of undesired goal in existing selfsupervised GANs is that their self-supervised classifiers that train the generator are unfortunately agnostic to the generator distribution. In other words, their classifiers cannot provide the discrepancy between the real and generated data distributions to teach the generator to approximate the real data distribution. This issue originates from that these classifiers are trained to only recognize the pseudolabel of the transformed real data but not the transformed generated data. Inspired by this observation, we realize that the self-supervised classifier should be trained to discriminatively recognize the corresponding pseudo-labels of the transformed generated data and the transformed real data. In this case, the classifier includes the task of the original discriminator, thus the original discriminator could even be omitted. Specifically, we augment the GAN labels (real or fake) via self-supervision of data transformation to construct augmented labels for a novel discriminator, which attempts to recognize the performed transformation of transformed real and fake data while simultaneously distinguishing between real and fake. By construction, this formulates a single multi-class classification task for the label-augmented discriminator, which is different from a single binary classification task in standard GANs or two separate classification tasks in existing self-supervised GANs. As a result, our method does not require any hyper-parameter to trade-off between the original GAN objective and the additional self-supervised objective, unlike existing self-supervised GANs. Moreover, we prove that the optimal label-augmented discriminator is aware of both the generator distribution and the data distribution under different transformations, and thus can provide the discrepancy between them to optimize the generator. Theoretically, the generator can replicate the real data distribution at the Nash equilibrium under mild assumptions that can be easily satisfied, successfully solving the convergence problem of previous self-supervised GANs. The proposed method seamlessly combines two unsupervised learning paradigms, generative adversarial networks and self-supervised learning, to achieve stable generative modeling of the generator and superior representation learning of the discriminator by taking advantage of each other. In the case of generative modeling, the self-supervised subtask resists the discriminator catastrophic forgetting, so that the discriminator can provide continuously informative feedback to the generator to learn the real data distribution. In the case of representation learning, the augmented-label prediction task encourages the discriminator to simultaneously capture both semantic information brought by the real/fake-distinguished signals and the self-supervised signals to learn comprehensive data representations. Experimental results on various datasets demonstrate the superiority of the proposed method compared to competitive methods on both generative modeling and representation learning. 2 Preliminaries 2.1 Generative Adversarial Networks Learning the underlying data distribution Pd, which is hard to specify but easy to sample from, lies at the heart of generative modeling. In the branch of deep latent variable generative models, the generator G : Z ! X maps a latent code z 2 Z endowed with a tractable prior Pz (e.g., N(0, I)) to a data point x 2 X, and thereby induces a generator distribution Pg = G#Pz. To learn the data distribution through the generator (i.e., Pg = Pd), one can minimize the Jensen Shannon (JS) divergence between the data distribution and the generator distribution (i.e., min G DJS(Pdk Pg)). To estimate this divergence, generative adversarial networks (GANs) [14] leverage a discriminator D : X ! {0, 1} that distinguishes the real data from the generated ones. Formally, the objective function for the discriminator and generator of the standard minimax GAN [14] is defined as follows: D V (G, D) = Ex Pd[log D(1|x)] + Ex Pg[log D(0|x)], (1) where D(1|x) = 1 D(0|x) 2 [0, 1] indicates the probability that a data x is distinguished as real by the discriminator. While theoretical guarantees hold in standard GANs, the non-stationary learning environment caused by the evolution of the generator contributes to catastrophic forgetting of the discriminator [6], which means that the discriminator forgets the previously learned knowledge by over-fitting the current data and may therefore hurt the generation performance of the generator. 2.2 Self-Supervised GAN To mitigate catastrophic forgetting of the discriminator, self-supervised GAN (SSGAN) [6] introduced an auxiliary self-supervised rotation recognition task to a classifier C : X = T (X) ! {1, 2, , K} that shares parameters with the discriminator, and forced the generator to collaborate with the classifier on the rotation recognition task. Formally, the objective functions of SSGAN are given by: D,C V (G, D) + λd Ex Pd,Tk T [log C(k|Tk(x))], (2) G V (G, D) λg Ex Pg,Tk T [log C(k|Tk(x))], (3) where λd and λg are two hyper-parameters that balance the original GAN task and the additional self-supervised task, T = {Tk}K k=1 is the set of deterministic data transformations such as rotations (i.e., T = {0 , 90 , 180 , 270 }). Throughout this paper, data transformations are uniformly sampled by default (i.e., p(Tk) = 1 K , 8Tk 2 T ), unless otherwise specified. Theorem 1 ([57]). Given the optimal classifier C (k| x) = p Tk d ( x) PK Tk d ( x) of SSGAN, at the equilibrium point, maximizing the self-supervised task for the generator is equivalent to: where PTk g , PTk d indicate the distribution of transformed generated or real data x 2 X under the transformation Tk with density of p Tk g ( x) = δ( x Tk(x))pg(x)dx or p Tk δ( x Tk(x))pd(x)dx. All proofs of theorems/propositions including this one (for completeness) in this paper are referred to Appendix A. Theorem 1 reveals that the self-supervised task of SSGAN enforces the generator to produce only rotation-detectable images other than the whole real images [57], causing a goal inconsistent with generative modeling that faithfully learns the real data distribution. Consequently, SSGAN still requires the original GAN task, which forms a multi-task learning framework that needs hyper-parameters to trade off the two separate tasks. 2.3 SSGAN with Multi-Class Minimax Game To improve the self-supervised task in SSGAN, SSGAN-MS [57] proposed a multi-class minimax self-supervised task by adding the fake class 0 to a label-extend classifier C+ : X ! {0, 1, 2, , K}, where the generator and the classifier compete on the self-supervised task. Formally, the objective functions of SSGAN-MS are defined as follows: max D,C+ V (G, D) + λd (Ex Pd,Tk T [log C+(k|Tk(x))] + Ex Pg,Tk T [log C+(0|Tk(x))]), (5) G V (G, D) λg (Ex Pg,Tk T [log C+(k|Tk(x))] Ex Pg,Tk T [log C+(0|Tk(x))]). (6) Theorem 2. Given the optimal classifier C +(k| x) = p T Tk d ( x) PK +(0| x) of SSGAN-MS, at the equilibrium point, maximizing the self-supervised task for the generator is equivalent to1: d represent the mixture distribution of transformed generated or real data x 2 X with density of p T g ( x) = PK k=1 p(Tk)p Tk g ( x) or p T d ( x) = PK k=1 p(Tk)p Tk 1Note that our Theorem 2 corrects the wrong version in the SSGAN-MS paper [57], where the authors mistakenly regard d ( x) p T g ( x) = k=1 p(Tk)p Tk k=1 p(Tk)p Tk g ( x) as p Tk g ( x) in their proof. Please see Appendix A.2 for details. Standard Data Augmentation Multi-task Learning Joint Learning GAN DAGAN DAGAN-MD SSGAN SSGAN-MS SSGAN-LA Figure 1: Schematics of the discriminators and classifiers of the competitive methods. We systematically divide these methods into four categories: Standard (GAN), Data Augmentation (DAGAN and DAGAN-MD), Multi-task Learning (SSGAN and SSGAN-MS), and Joint Learning (SSGAN-LA). Our Theorem 2 states that the improved self-supervised task for the generator of SSGAN-MS, under the optimal classifier, retains the inconsistent goal in SSGAN that enforces the generator to produce typical rotation-detectable images rather than the entire real images. In addition, minimizing DKL(PT d ) cannot guarantee the generator to recover the real data distribution under the used data transformation setting (see Theorem 4). Due to the existence of these issues, SSGAN-MS also requires the original GAN task to approximate the real data distribution. We regard the learning paradigm of SSGAN-MS and SSGAN as the multi-task learning framework (see Figure 1). We note that the self-supervised classifiers of existing self-supervised GANs are agnostic to the density of the generator distribution pg(x) (or p T1 g ( x)) (see C (k| x) in Theorem 1 and C +(k| x) in Theorem 2). In other words, their classifiers do not know how far away the current generator distribution Pg is from the data distribution Pd, making them unable to provide any informative guidance to match the generator distribution with the data distribution. Consequently, existing self-supervised GANs suffer from the problem of involving an undesired goal for the generator when learning from the generator distribution-agnostic self-supervised classifiers. One naive solution for this problem is simply discarding the self-supervised task for the generator. However, the generator cannot take full advantage of the direct benefits of self-supervised learning in this case (see Appendix D). In this work, our goal is to make full use of self-supervision to overcome catastrophic forgetting of the discriminator and enhance generation performance of the generator without introducing any undesired goal. We argue that the generator-agnostic issue of previous self-supervised classifiers originates from that those classifiers are trained to recognize the corresponding angles for only the rotated real images but not the rotated fake images. Motivated by the this understanding, we propose to allow the self-supervised classifier to discriminatively recognize the corresponding angles of the rotated real and fake images such that it can be aware of the generator distribution and the data distribution like the discriminator. In particular, the new classifier includes the role of the original discriminator, therefore the discriminator could be omitted though the convergence speed might be slowed down (see Appendix C for performance with the original discriminator). Specifically, we augment the original GAN labels (real or fake) via the self-supervision of data transformation to construct augmented labels and develop a novel label-augmented discriminator DLA : X ! {1, 2, , K} {0, 1} (2K classes) to predict the augmented labels. By construction, this formulation forms a single multi-class classification task for the label-augmented discriminator, which is different from a single binary classification in standard GANs and two separate classification tasks in existing self-supervised GANs (see Figure 1). Formally, the objective function for the label-augmented discriminator of the proposed self-supervised GANs with label augmentation, dubbed SSGAN-LA, is defined as follows: DLA Ex Pd,Tk T [log DLA(k, 1|Tk(x))] + Ex Pg,Tk T [log DLA(k, 0|Tk(x))], (8) where DLA(k, 1|Tk(x)) (resp. DLA(k, 0|Tk(x))) outputs the probability that a transformed data Tk(x) is jointly classified as the augmented label of real (resp. fake) and the k-th transformation. Proposition 1. For any fixed generator, given a data x 2 X that drawn from mixture distribution of transformed data, the optimal label-augmented discriminator of SSGAN-LA has the form of: LA(k, 1| x) = p Tk d ( x) + p Tk g ( x)) LA(k, 0| x) = p Tk g ( x) PK d ( x) + p Tk g ( x)) Proposition 1 shows that the optimal label-augmented discriminator of SSGAN-LA is able to aware the densities of both real and generated data distributions under every transformation (i.e., p Tk d ( x) and p Tk g ( x)), and thus can provide the discrepancy (i.e., p Tk d ( x)/p Tk g ( x) = D (k, 1| x)/D (k, 0| x)) between the data distribution and the generator distribution under every transformation to optimize the generator. Consequently, we now propose to optimize the generator of SSGAN-LA by optimizing such discrepancy using the objective function formulated as follows: G Ex Pg,Tk T [log DLA(k, 1|Tk(x))] Ex Pg,Tk T [log DLA(k, 0|Tk(x))]. (10) Theorem 3. The objective function for the generator of SSGAN-LA, given the optimal labelaugmented discriminator, boils down to: DKL(PTk g k PTk The global minimum is achieved if and only if Pg = Pd when 9Tk 2 T is an invertible transformation. Theorem 3 shows that the generator of SSGAN-LA is encouraged to approximate the real data distribution under different transformations measured by the reverse KL divergence without any goal inconsistent with generative modeling, unlike previous self-supervised GANs. In particular, the generator is accordingly guaranteed to replicate the real data distribution when an existing transformation is invertible such as the identity and 90-degree rotation transformations. In summary, SSGAN-LA seamlessly incorporates self-supervised learning into the original GAN objective to improve the training stability on the premise of faithfully learning of the real data distribution. Compared with standard GANs, the transformation prediction task, especially on the real samples, establishes stationary self-supervised learning environments in SSGAN-LA to prevent the discriminator from catastrophic forgetting. In addition, the multiple discriminative signals under different transformations are capable of providing diverse feedback to obtain a well-performed generator. Compared with previous self-supervised GANs, in addition to the unbiased learning objective for the generator, the discriminator of SSGAN-LA has the ability to learn better representations. In particular, the discriminator utilizes not only the real data but also the generated data as augmented data to obtain robust transformation-recognition ability, and distinguishes between real data and generated data from multiple perspectives via different transformations to obtain enhanced discrimination ability. These two abilities enable the discriminator to learn more meaningful representations than existing self-supervised GANs, which could in turn promote the generation performance of the generator. 4 The Importance of Self-Supervision The proposed SSGAN-LA might benefit from two potential aspects, i.e., augmented data and the corresponding self-supervision. To verify the importance of the latter, we introduce two ablation models that do not explicitly receive feedback from the self-supervision as competitors. 4.1 Data Augmentation GAN We first consider a naive data augmentation GAN (DAGAN) that completely ignores self-supervision. Formally, the objective function of DAGAN is formulated as follows: D Ex Pd,Tk T [log D(1|Tk(x))] + Ex Pg,Tk T [log D(0|Tk(x))]. (12) This formulation is mathematically equivalent to Diff Augment [66] and IDA [59]. We regard this approach as an ablation model of SSGAN-LA as it trains with only the GAN labels but ignores the self-supervised labels. Next, we show the indispensability of self-supervision. Theorem 4. At the equilibrium point of DAGAN, the optimal generator implies PT d . However, if (T , ) forms a group and Tk 2 T is uniformly sampled, then the probability that the optimal generator replicates the real data distribution is P(Pg = Pd|PT Theorem 4 suggests that the formulation of DAGAN is insufficient to recover the real data distribution for the generator as ({0 , 90 , 180 , 270 }, ) forms a cyclic group and the transformation is uniformly sampled by default. Intuitively, if transformations are not identified by their corresponding self-supervision and can be represented by others, then the transformed data will be indistinguishable from the original data so that they may be leaked to the generator. In more detail, the generator would converge to an arbitrary mixture distribution of transformed real data (see Appendix A.5). Note that all theoretical results in this paper are not limited to rotation but many other types of data transformation, e.g., RBG permutation and patch shuffling. Nonetheless, we follow the practice of SSGAN to use rotation as the data transformation for fairness. In summary, data augmentation GANs may even suffer from the convergence problem without utilizing self-supervision. 4.2 DAGAN with Multiple Discriminators To verify the importance of self-supervision while eliminating the convergence problem of DAGAN, we introduce another data augmentation GAN that has the same convergence point as SSGAN-LA. DAG [59] proposed a data augmentation GAN with multiple discriminators {Dk}K k=1 (we refer it as DAGAN-MD throughout this paper), which are indexed by the self-supervision and share all layers but the head with each other, to solve the convergence issue of DAGAN. Formally, the objective function of DAGAN-MD is formulated as: G max {Dk}K Ex Pd,Tk T [log Dk(1|Tk(x))] + Ex Pg,Tk T [log Dk(0|Tk(x))]. (13) We regard DAGAN-MD as an ablation model of SSGAN-LA since DAGAN-MD takes the selfsupervised signal as input while SSGAN-LA views it as target (see Figure 1). It is proved that the generator of DAGAN-MD could approximate the real data distribution at the optimum [59]. However, the lack of leveraging self-supervised signals as supervision will disadvantage the performance of DAGAN-MD in two ways. On one hand, the discriminators cannot learn stable representations to resist catastrophic forgetting without learning from self-supervised tasks. On the other hand, the discriminators cannot learn high-level representations contained in the specific semantics of different transformations. Specifically, these discriminators presumably enforce invariance to the shared layers [28], which could hurt the performance of representation learning as transformations modify the semantics of data. As we emphasized above, the more useful information learned by the discriminator, the better guidance provided to the generator, and the better optimization result reached by the generator. Accordingly, the performance of generative modeling would be affected. 5 Experiments Our code is available at https://github.com/houliangict/ssgan-la. 5.1 SSGAN-LA Faithfully Learns the Real Data Distribution We experiment on a synthetic dataset to intuitively verify whether SSGAN, SSGAM-MS, and SSGAN-LA can accurately match the real data distribution. The real data distribution is a onedimensional normal distribution Pd = N(0, 1). The set of transformations is T = {Tk}K=4 k=1 with the k-th transformation of Tk(x) := x + 2(k 1) (T1 is the identity transformation). We assume that the distributions of real data through different transformations have non-negligible overlaps. The generator and discriminator networks are implemented by multi-layer perceptrons with hidden size of 10 and non-linearity of Tanh. Figure 2 shows the density estimation of the target and generated data estimated by kernel density estimation [44] through sampling 10k data. SSGAN learns a biased distribution as its self-supervised task forces the generator to produce transformation-classifiable data. SSGAN-MS slightly mitigates this issue, but still cannot accurately approximate the real data distribution. The proposed SSGAN-LA successfully replicates the data distribution and achieves the best maximum mean discrepancy (MMD) [15] results as reported in brackets. In general, experimental results on synthetic data accord well with theoretical results of self-supervised methods. (a) Real Data (b) SSGAN (0.4367) (c) SSGAN-MS (0.2551) (d) SSGAN-LA (0.0035) Figure 2: Learned distribution on one-dimensional synthetic data. The blue line at left-most is the untransformed real or generated data distribution. The numbers in brackets are the MMD [15] scores. (b) SSGAN-LA Figure 3: Generated samples on Celeb A. DAGAN suffers from the augmentation-leaking issue. 5.2 SSGAN-LA Resolves the Augmentation-Leaking Issue Figure 3 shows images randomly generated by DAGAN and SSGAN-LA on the Celeb A dataset [34]. Apparently, the generator of DAGAN converges to generate the rotated images, validating Theorem 4 that DAGAN cannot guarantee to recover the original data distribution under the used rotation-based data augmentation setting. The proposed SSGAN-LA resolves this augmentation-leaking problem attributed to explicitly identifying different rotated images by predicting the self-supervised pseudo signals (i.e., rotation angles). 5.3 Comparison of Sample Quality We conduct experiments on three real-world datasets: CIFAR-10 [26], STL-10 [8], and Tiny Image Net [27]. We implement all methods based on unconditional Big GAN [4] without accessing human-annotated labels. We defer the detailed experimental settings into Appendix B due to the limited space. To evaluate the performance of methods on sample quality, we adopt two widely-used evaluation metrics: Fréchet Inception Distance (FID) [17] and Inception Score (IS) [49]. We follow the practice of Big GAN to randomly sample 50k images to calculate the IS and FID scores. As shown in Table 1, SSGAN and SSGAN-MS surpass GAN due to the auxiliary self-supervised tasks that encourage discriminators to learn stable representations to resist the forgetting issue. As DAGAN suffers from the augmentation-leaking problem [22] in the default setting, we up-weight the probability of the identity transformation for DAGAN (named DAGAN+) to avoid the undesirable augmentation-leaking problem. DAGAN+ and DAGAN-MD have made considerable improvements over GAN on CIFAR-10 and STL-10 but a substantial degradation on Tiny-Image Net, indicating Table 1: FID (#) and IS (") comparison of methods on CIFAR-10, STL-10, and Tiny-Image Net. Dataset Metric GAN SSGAN SSGAN-MS DAGAN+ DAGAN-MD SSGAN-LA CIFAR-10 FID (#) 10.83 7.52 7.08 10.01 7.80 5.87 IS (") 8.42 8.29 8.45 8.10 8.31 8.57 STL-10 FID (#) 20.15 16.84 16.46 16.42 16.16 14.58 IS (") 10.25 10.36 10.40 10.27 10.31 10.61 Tiny-Image Net FID (#) 31.01 30.09 25.76 52.54 39.57 23.69 IS (") 9.80 10.21 10.57 7.55 8.44 11.14 Table 2: Accuracy (") comparison of methods on CIFAR-10, CIFAR-100, and Tiny-Image Net based on learned representations extracted from each residual block of the discriminator. Dataset Block GAN SSGAN SSGAN-MS DAGAN+ DAGAN-MD SSGAN-LA Block1 0.641 0.636 0.640 0.639 0.633 0.637 Block2 0.729 0.722 0.728 0.726 0.720 0.741 Block3 0.765 0.769 0.770 0.764 0.746 0.782 Block4 0.769 0.784 0.787 0.772 0.768 0.803 Block1 0.323 0.320 0.322 0.323 0.324 0.318 Block2 0.456 0.452 0.461 0.460 0.443 0.468 Block3 0.494 0.493 0.502 0.493 0.478 0.511 Block4 0.504 0.513 0.521 0.519 0.492 0.543 Tiny-Image Net Block1 0.142 0.138 0.132 0.155 0.135 0.129 Block2 0.221 0.224 0.214 0.184 0.184 0.209 Block3 0.251 0.277 0.280 0.211 0.231 0.283 Block4 0.325 0.346 0.344 0.263 0.279 0.349 Block5 0.141 0.313 0.341 0.110 0.224 0.351 that only data augmentation without self-supervised signals cannot bring consistent improvements. The proposed SSGAN-LA significantly outperforms existing SSGANs and DAGANs by inheriting their advantages while overcoming their shortcomings. Compared with existing SSGANs, SSGANLA inherits the benefits of self-supervised tasks while avoiding the undesired learning objective. Compared with DAGANs, SSGAN-LA inherits the benefits of augmented data and utilizes selfsupervised signals to enhance the representation learning ability of the discriminator that could provide more valuable guidance to the generator. 5.4 Comparison of Representation Quality In order to check whether the discriminator learns meaningful representations, we train a 10-way logistic regression classifier on CIFAR-10, 100-way on CIFAR-100, and 200-way on Tiny-Image Net, respectively, using the learned representation of real data extracted from residual blocks in the discriminator. Specifically, we train the classifier with a batch size of 128 for 50 epochs. The optimizer is Adam with a learning rate of 0.05 and decayed by 10 at both epoch 30 and epoch 40, following the practice of [6]. The linear model is trained on the training set and tested on the validation set of the corresponding datasets. Classification accuracy is the evaluation metric. As shown in Table 2, SSGAN and SSGAN-MS obtain improvements on GAN, confirming that the self-supervised tasks facilitate learning more meaningful representations. DAGAN-MD performs substantially worse than the self-supervised counterparts and even worse than GAN and DAGAN+ in some cases, validating that the multiple discriminators of DAGAN-MD tend to weaken the representation learning ability in the shared blocks. In other words, the shared blocks are limited to invariant features, which may lose useful information about data. Compared with all competitive baselines, the proposed SSGAN-LA achieves the best accuracy results on most blocks, especially the deep blocks. These results verify that the discriminator of SSGAN-LA learns meaningful high-level representations. We argue that the powerful representation ability of the discriminator could in turn promote the performance of generative modeling of the generator. Figure 4: Accuracy curves during GAN training on CIFAR-10. 5.5 SSGAN-LA Overcomes Catastrophic Forgetting Figure 4 plots the accuracy (calculated following Section 5.4) curves of different methods during GAN training on CIFAR-10. With the increase of GAN training iterations, the accuracy results of GAN and DAGAN-MD first increased and then decreased, indicating that their discriminators both suffer from catastrophic forgetting to some extent. The proposed SSGAN-LA achieves consecutive increases in terms of accuracy and consistently surpasses other competitive methods, verifying that utilizing self-supervision can effectively overcome the catastrophic forgetting issue. Table 3: FID and IS comparison under full and limited data regimes on CIFAR-10 and CIFAR-100. Metric Method CIFAR-10 CIFAR-100 100% 20% 10% 100% 20% 10% FID Big GAN [4] + Diff Augment [66] 8.70 14.04 22.40 12.00 22.14 33.70 + SSGAN-LA 8.05 11.16 15.08 9.77 15.91 23.17 IS Big GAN [4] + Diff Augment [66] 9.16 8.65 8.09 10.66 9.47 8.38 + SSGAN-LA 9.30 8.84 8.30 11.13 10.52 9.81 5.6 SSGAN-LA Improves the Data Efficiency of GANs In this section, we experiment on the full and limited data regimes on CIFAR-10 and CIFAR-100 compared with the state-of-the-art data augmentation method (Diff Augment [66]) for training GANs. We adopt the codebase of Diff Augment and follow its practice to randomly sample 10k data for calculating the FID and IS scores for a fair comparison. As shown in Table 3, SSGAN-LA significantly improves the data efficiency of GANs and to our knowledge achieves the new state-of-the-art FID and IS results under the full and limited data regimes on CIFAR-10 and CIFAR-100 based on the Big GAN [4] backbone. We argue that the reason is that the proposed label-augmented discriminator is less easy to over-fitting than the normal discriminator by solving a more challenging task. 6 Related Work Since generative adversarial networks (GANs) [14] appearance, numerous variants focus on improving the generation performance from perspectives of network architecture [48, 63, 5, 4, 23, 24], divergence minimization [43, 2, 16, 30, 53, 21], learning procedure [37, 51, 18], weight normalization [38, 50, 67], model regularization [46, 62, 54], and multiple discriminators [41, 12, 40, 10, 1]. Recently, self-supervision has been applied to GANs to improve the training stability. Self-supervised learning is a family of unsupervised representation learning methods that solve pretext tasks. Various self-supervised pretext tasks on image representation learning have been proposed, e.g., predicting context [11], solving jigsaw puzzle [42], and recognizing rotation [13]. [6] introduced an auxiliary rotation prediction task [13] onto the discriminator to alleviate catastrophic forgetting. [57, 58] analyzed the drawback of the self-supervised task in [6] and proposed an improved self-supervised task. [35] combined selfand semi-supervised learning to surpass fully supervised GANs. [3] introduced a deshuffling task to the discriminator that solves the jigsaw puzzle. [19] required the discriminator to examine the structural consistency of real images. [45] extended the self-supervision into the latent space by detecting the latent transformation. [29] explored contrastive learning and mutual information maximization to simultaneously tackle the catastrophic forgetting and mode collapse issues. [64] presented a transformation-based consistency regularization, which has improved in [68, 69]. The similar label augmentation technique [28] for discriminators is also proposed in DADA [65] and [60]. However, the former (DADA) aimed at data augmentation for few-shot classification, and the latter focused on self-labeling in semi-supervised learning. Most importantly, they are both different from ours with the loss function for the generator, which is the key for the Theorem 3 that guarantees the generator to faithfully learn the real data distribution. 7 Conclusion In this paper, we present a novel formulation of transformation-based self-supervised GANs that is able to faithfully learn the real data distribution. Specifically, we unify the original GAN task with the self-supervised task into a single multi-class classification task for the discriminator (forming a label-augmented discriminator) by augmenting the original GAN labels (real or fake) via selfsupervision of data transformation. Consequently, the generator is encouraged to minimize the reversed Kullback Leibler divergence between the real data distribution and the generated data distribution under different transformations provided by the optimal label-augmented discriminator. We provide both theoretical analysis and empirical evidence to support the superiority of the proposed method compared with previous self-supervised GANs and data augmentation GANs. In the future, we will explore and exploit the proposed approach to semiand fully supervised GANs. Acknowledgments and Disclosure of Funding This work is funded by the National Natural Science Foundation of China under Grant Nos. 62102402, 91746301, and National Key R&D Program of China (2020AAA0105200). Huawei Shen is also supported by Beijing Academy of Artificial Intelligence (BAAI) under the grant number BAAI2019QN0304 and K.C. Wong Education Foundation. [1] I. Albuquerque, J. Monteiro, T. Doan, B. Considine, T. Falk, and I. Mitliagkas. Multi-objective training of generative adversarial networks with multiple discriminators. In Proceedings of the 36th International Conference on Machine Learning, 2019. [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, 2017. [3] G. Baykal and G. Unal. Deshufflegan: A self-supervised gan to improve structure learning. In 2020 IEEE International Conference on Image Processing (ICIP), pages 708 712. IEEE, 2020. [4] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019. [5] T. Chen, M. Lucic, N. Houlsby, and S. Gelly. On self modulation for generative adversarial networks. In International Conference on Learning Representations, 2019. [6] T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby. Self-supervised gans via auxiliary rotation loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [7] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 2016. [8] A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215 223. JMLR Workshop and Conference Proceedings, 2011. [9] H. de Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. C. Courville. Modulating early visual processing by language. In Advances in Neural Information Processing Systems 30. 2017. [10] T. Doan, J. Monteiro, I. Albuquerque, B. Mazoure, A. Durand, J. Pineau, and R. D. Hjelm. On-line adaptative curriculum learning for gans. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3470 3477, 2019. [11] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. [12] I. Durugkar, I. Gemp, and S. Mahadevan. Generative multi-adversarial networks. ar Xiv preprint ar Xiv:1611.01673, 2016. [13] S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, 2018. [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27. 2014. [15] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 2012. [16] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems 30. 2017. [17] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30. 2017. [18] L. Hou, Z. Yuan, L. Huang, H. Shen, X. Cheng, and C. Wang. Slimmable generative adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7746 7753, 2021. [19] R. Huang, W. Xu, T.-Y. Lee, A. Cherian, Y. Wang, and T. Marks. Fx-gan: Self-supervised gan learning via feature exchange. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2020. [20] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448 456, Lille, France, 07 09 Jul 2015. PMLR. [21] A. Jolicoeur-Martineau. On relativistic f-divergences. In Proceedings of the 37th International Conference on Machine Learning, 2020. [22] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila. Training generative adversarial networks with limited data. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12104 12114. Curran Associates, Inc., 2020. [23] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [24] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of stylegan. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [25] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 2017. [26] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009. [27] Y. Le and X. Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015. [28] H. Lee, S. J. Hwang, and J. Shin. Self-supervised label augmentation via input transformations. In Proceedings of the 37th International Conference on Machine Learning, 2020. [29] K. S. Lee, N.-T. Tran, and N.-M. Cheung. Infomax-gan: Improved adversarial image generation via information maximization and contrastive learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3942 3952, January 2021. [30] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Poczos. Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, 2017. [31] F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394 4412, 2006. [32] J. H. Lim and J. C. Ye. Geometric gan. ar Xiv preprint ar Xiv:1705.02894, 2017. [33] Z. Lin, A. Khetan, G. Fanti, and S. Oh. Pacgan: The power of two samples in generative adversarial networks. In Advances in Neural Information Processing Systems, 2018. [34] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. [35] M. Luˇci c, M. Tschannen, M. Ritter, X. Zhai, O. Bachem, and S. Gelly. High-fidelity image generation with fewer labels. In Proceedings of the 36th International Conference on Machine Learning, 2019. [36] Q. Mao, H.-Y. Lee, H.-Y. Tseng, S. Ma, and M.-H. Yang. Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [37] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. ar Xiv preprint ar Xiv:1611.02163, 2016. [38] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. [39] T. Miyato and M. Koyama. c GANs with projection discriminator. In International Conference on Learning Representations, 2018. [40] B. Neyshabur, S. Bhojanapalli, and A. Chakrabarti. Stabilizing gan training with multiple random projections. ar Xiv preprint ar Xiv:1705.07831, 2017. [41] T. Nguyen, T. Le, H. Vu, and D. Phung. Dual discriminator generative adversarial nets. In Advances in Neural Information Processing Systems, 2017. [42] M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, 2016. [43] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems 29. 2016. [44] E. Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 1962. [45] P. Patel, N. Kumari, M. Singh, and B. Krishnamurthy. Lt-gan: Self-supervised gan with latent transformation detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3189 3198, January 2021. [46] H. Petzka, A. Fischer, and D. Lukovnikov. On the regularization of wasserstein GANs. In International Conference on Learning Representations, 2018. [47] Y. Qiao and N. Minematsu. A study on invariance of f-divergence and its application to speech recognition. IEEE Transactions on Signal Processing, 58(7):3884 3890, 2010. [48] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolu- tional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015. [49] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems 29. 2016. [50] A. Sanyal, P. H. Torr, and P. K. Dokania. Stable rank normalization for improved generalization in neural networks and gans. In International Conference on Learning Representations, 2020. [51] F. Schaefer, H. Zheng, and A. Anandkumar. Implicit competitive regularization in GANs. In Proceedings of the 37th International Conference on Machine Learning, 2020. [52] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, 2017. [53] C. Tao, L. Chen, R. Henao, J. Feng, and L. C. Duke. Chi-square generative adversarial network. In Proceedings of the 35th International Conference on Machine Learning, 2018. [54] D. Terjék. Adversarial lipschitz regularization. In International Conference on Learning Representations, 2020. [55] H. Thanh-Tung and T. Tran. Catastrophic forgetting and mode collapse in gans. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1 10. IEEE, 2020. [56] D. Tran, R. Ranganath, and D. M. Blei. Deep and hierarchical implicit models. ar Xiv preprint ar Xiv:1702.08896, 2017. [57] N.-T. Tran, V.-H. Tran, B.-N. Nguyen, L. Yang, and N.-M. M. Cheung. Self-supervised gan: Analysis and improvement with multi-class minimax game. In Advances in Neural Information Processing Systems, 2019. [58] N.-T. Tran, V.-H. Tran, N.-B. Nguyen, and N.-M. Cheung. An improved self-supervised gan via adversarial training. ar Xiv preprint ar Xiv:1905.05469, 2019. [59] N.-T. Tran, V.-H. Tran, N.-B. Nguyen, T.-K. Nguyen, and N.-M. Cheung. On data augmentation for gan training. IEEE Transactions on Image Processing, 30:1882 1897, 2021. [60] T. Watanabe and P. Favaro. A unified generative adversarial network training via self-labeling and self-attention. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11024 11034. PMLR, 18 24 Jul 2021. [61] C. Xiao, P. Zhong, and C. Zheng. Bourgan: Generative networks with metric embeddings. In Advances in Neural Information Processing Systems, 2018. [62] S. Yamaguchi and M. Koyama. Distributional concavity regularization for GANs. In Interna- tional Conference on Learning Representations, 2019. [63] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. In Proceedings of the 36th International Conference on Machine Learning, 2019. [64] H. Zhang, Z. Zhang, A. Odena, and H. Lee. Consistency regularization for generative adversarial networks. In International Conference on Learning Representations, 2020. [65] X. Zhang, Z. Wang, D. Liu, and Q. Ling. Dada: Deep adversarial data augmentation for extremely low data regime classification. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2807 2811. IEEE, 2019. [66] S. Zhao, Z. Liu, J. Lin, J.-Y. Zhu, and S. Han. Differentiable augmentation for data-efficient gan training. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 7559 7570. Curran Associates, Inc., 2020. [67] Y. Zhao, C. Li, P. Yu, J. Gao, and C. Chen. Feature quantization improves GAN training. In Proceedings of the 37th International Conference on Machine Learning, 2020. [68] Z. Zhao, S. Singh, H. Lee, Z. Zhang, A. Odena, and H. Zhang. Improved consistency regular- ization for gans. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11033 11041, 2021. [69] Z. Zhao, Z. Zhang, T. Chen, S. Singh, and H. Zhang. Image augmentations for gan training. ar Xiv preprint ar Xiv:2006.02595, 2020. [70] Z. Zhou, J. Liang, Y. Song, L. Yu, H. Wang, W. Zhang, Y. Yu, and Z. Zhang. Lipschitz generative adversarial nets. In Proceedings of the 36th International Conference on Machine Learning, 2019.