# augmentationaware_selfsupervision_for_dataefficient_gan_training__7b880ea3.pdf Augmentation-Aware Self-Supervision for Data-Efficient GAN Training Liang Hou1,3,4, Qi Cao1, Yige Yuan1,3, Songtao Zhao4, Chongyang Ma4, Siyuan Pan4, Pengfei Wan4, Zhongyuan Wang4, Huawei Shen1,3, Xueqi Cheng2,3 1CAS Key Laboratory of AI Safety and Security, Institute of Computing Technology, Chinese Academy of Sciences 2CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences 3University of Chinese Academy of Sciences 4Kuaishou Technology lianghou96@gmail.com Training generative adversarial networks (GANs) with limited data is challenging because the discriminator is prone to overfitting. Previously proposed differentiable augmentation demonstrates improved data efficiency of training GANs. However, the augmentation implicitly introduces undesired invariance to augmentation for the discriminator since it ignores the change of semantics in the label space caused by data transformation, which may limit the representation learning ability of the discriminator and ultimately affect the generative modeling performance of the generator. To mitigate the negative impact of invariance while inheriting the benefits of data augmentation, we propose a novel augmentation-aware self-supervised discriminator that predicts the augmentation parameter of the augmented data. Particularly, the prediction targets of real data and generated data are required to be distinguished since they are different during training. We further encourage the generator to adversarially learn from the self-supervised discriminator by generating augmentation-predictable real and not fake data. This formulation connects the learning objective of the generator and the arithmetic harmonic mean divergence under certain assumptions. We compare our method with stateof-the-art (SOTA) methods using the class-conditional Big GAN and unconditional Style GAN2 architectures on data-limited CIFAR-10, CIFAR-100, FFHQ, LSUNCat, and five low-shot datasets. Experimental results demonstrate significant improvements of our method over SOTA methods in training data-efficient GANs.1 1 Introduction Generative adversarial networks (GANs) [10] have achieved great progress in synthesizing diverse and high-quality images in recent years [2, 17, 19, 20, 13]. However, the generation quality of GANs depends heavily on the amount of training data [47, 18]. In general, the decrease of training samples usually yields a sharp decline in both fidelity and diversity of the generated images [39, 48]. This issue hinders the wide application of GANs due to the fact of insufficient data in real-world applications. For instance, it is valuable to imitate the style of an artist whose paintings are limited. GANs typically consist of a generator that is designed to generate new data and a discriminator that guides the generator to recover the real data distribution. The major challenge of training GANs under limited data is that the discriminator is prone to overfitting [47, 18], and therefore lacks generalization to teach the generator to learn the underlying real data distribution. 1Our code is available at https://github.com/liang-hou/augself-gan. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). In order to alleviate the overfitting issue, recent researches have suggested a variety of approaches, mainly from the perspectives of training data [18], loss functions [40], and network architectures [28]. Among them, data augmentation-based methods have gained widespread attention due to its simplicity and extensibility. Specifically, Diff Augment [47] introduced differentiable augmentation techniques for GANs, in which both real and generated data are augmented to supplement the training set of the discriminator. However, this straightforward augmentation method overlooks augmentation-related semantic information, as it solely augments the domain of the discriminator while neglecting the range. Such a practice might introduces an inductive bias that potentially forces the discriminator to remain invariant to different augmentations [24], which could limit the representation learning of the discriminator and subsequently affect the generation performance of the generator [12]. In this paper, we propose a novel augmentation-aware self-supervised discriminator that predicts the augmentation parameter of augmented data with the original data as reference to address the above problem. Meanwhile, the self-supervised discriminator is required to be distinguished between the real data and the generated data since their distributions are different during training, especially in the early stage. The proposed discriminator can benefit the generator in two ways, implicitly and explicitly. On one hand, the self-supervised discriminator can transfer the learned augmentation-aware knowledge to the original discriminator through parameter sharing. On the other hand, we allow the generator to learn adversarially from the self-supervised discriminator by generating augmentationpredictable real and not fake data (Equation (6)). We also theoretically analyzed the connection between this objective function and the minimization of a robust f-divergence divergence (the arithmetic harmonic mean divergence [37]). In experiments, we show that the proposed method compares favorably to the data augmentation counterparts and other state-of-the-art (SOTA) methods on common data-limited benchmarks (CIFAR-10 [21], CIFAR-100 [21], FFHQ [17], LSUN-Cat [45], and five low-shot image generation datasets [36]) based on the class-conditional Big GAN [2] and unconditional Style GAN2 [19] architectures. In addition, we carried out extensive experiments to demonstrate the effectiveness of the objective function design, the adaptability to stronger data augmentations, and the robustness of hyper-parameter selection in our method. 2 Related Work In this section, we provide an overview of existing work related to training GANs in data-limited scenarios. We also discuss methodologies incorporating self-supervised learning techniques. 2.1 GANs under Limited Training Data Recently, researchers have become interested in freeing training GANs from the need to collect large amounts of data for adaptability in real-world scenarios. Previous studies typically fall into two main categories. The first one involves adopting a pre-trained GAN model to the target domain by fine-tuning partial parameters [41, 31, 42, 30]. However, it requires external training data, and the adoption performance depends heavily on the correlation between the source and target domains. The other one focuses on training GANs from scratch with elaborated data-efficient training strategies. Diff Augment [47] utilized differentiable augmentation to supplement the training set to prevent discriminator from overfitting in limited data regimes. Concurrently, ADA [18] introduced adaptive data augmentation with a richer set of augmentation categories. APA [16] adaptively augmented the real data with the most plausible generated data. Le Cam-GAN [40] proposed adaptive regularization for the discriminator and showed a connection to the Le Cam divergence [23]. [3] discovered that sparse sub-network (lottery tickets) [5] and feature-level adversarial augmentation could offer orthogonal gains to data augmentation methods. Ins Gen [43] improved the data efficiency of training GANs by incorporating instance discrimination tasks to the discriminator. Masked GAN [14] employed masking in the spatial and spectral domains to alleviate the discriminator overfitting issue. Gen Co [7] discriminated samples from multiple views with weight-discrepancy and data-discrepancy mechanisms. Fre GAN [44] focused on discriminating between real and fake samples in the highfrequency domain. Dig GAN [8] constrains the discriminator gradient gap between real and generated data. Fast GAN [28] designed a lightweight generator architecture and observed that a self-supervised discriminator could enhance low-shot generation performance. Our method falls into the second category, supplementing data augmentation-based GANs and can be also applied to other methods. Original data ωcolor = (0.2, 0.3, 0.5) ωtranslation = (0.1, 0.8) ωcutout = (0.4, 0.6) Figure 1: Examples of images with different kinds of differentiable augmentation (including the original unaugmented one) and their re-scaled corresponding augmentation parameters ω [0, 1]d. 2.2 GANs with Self-Supervised Learning Self-supervised learning techniques excel at learning meaningful representations without humanannotated labels by solving pretext tasks. Transformation-based self-supervised learning methods such as rotation recognition [9] have been incorporated into GANs to address catastrophic forgetting in discriminators [4, 38, 12]. Various other self-supervised tasks have also been explored, including jigsaw puzzle solving [1], latent transformation detection [33], and mutual information maximization [25]. Moreover, Contra D [15] decouples the representation learning and discrimination of the discriminator, utilizing contrastive learning for representation learning and a discriminator head for distinguishing real from fake upon the contrastive representations. In contrast to ours, CR-GAN [46] and ICR-GAN proposed consistency regularization for the discriminator, which corresponds to an explicit augmentation-invariant of the discriminator. both our proposed method and SSGAN-LA [12] belong to adversarial self-supervised learning, they differ in the type of self-supervised signals and model inputs. SSGAN-LA is limited to categorical self-supervision [9], which is incompatible with popular augmentation-based GANs like Diff Augment [47]. Our method is applicable for continuous self-supervision and integrates seamlessly with Diff Augment. Furthermore, continuous self-supervision have a magnitude relationship and thus can provide more refined gradient feedback for the model to overcome overfitting in data-limited scenarios. Additionally, unlike SSGAN-LA, our method does not constrain the invertibility of data transformations (Theorem 1) because it additionally take the original sample as input for the self-supervised discriminator (Equation (5)). 3 Preliminaries In this section, we introduce the necessary concepts and preliminaries for completeness of the paper. 3.1 Generative Adversarial Networks Generative adversarial networks (GANs) [10] typically contain a generator G : Z X that maps a low-dimensional latent code z Z endowed with a tractable prior p(z), e.g., multivariate normal distribution N(0, I), to a high-dimensional data point x X, which induces a generated data distribution (density) p G(x) = R Z p(z)δ(x G(z))dz with the Dirac delta distribution δ( ), and also contain a discriminator D : X R that is required to distinguish between the real data sampled from the underlying data distribution (density) pdata(x) and the generated ones. The generator attempts to fool the discriminator to eventually recover the real data distribution, i.e., p G(x) = pdata(x). Formally, the loss functions for the discriminator and the generator can be formulated as follows: LD = Ex pdata(x)[f(D(x))] + Ez p(z)[h(D(G(z)))], (1) LG = Ez p(z)[g(D(G(z)))]. (2) Different real-valued functions f, h, and g correspond to different variants of GANs [32]. For example, the minimax GAN [10] can be constructed by setting f(x) = log(σ(x)) and h(x) = g(x) = log(1 σ(x)) with the sigmoid function σ(x) = 1/(1 + exp( x)). In this study, we follow the practices of Diff Augment [47] to adopt the hinge loss [26], i.e., f(x) = h( x) = max(0, 1 x) and g(x) = x, for experiments based on Big GAN [2] and the log loss [10], i.e., f(x) = g(x) = log(σ(x)) and h(x) = log(1 σ(x)), for experiments based on Style GAN2 [19]. 100% data 20% data 10% data 100% data 20% data 10% data Big GAN+Diff Aug Aug Self-Big GAN Figure 2: Comparison of representation learning ability of discriminator between Big GAN + Diff Augment and our Aug Self-Big GAN on CIFAR-10 and CIFAR-100 using linear logistic regression. 3.2 Differentiable Augmentation for GANs Diff Augment [47] introduces differentiable augmentation T : X Ω ˆ X parameterized by a randomly-sampled parameter ω Ωwith a prior p(ω) for data-efficient GAN training. The parameter ω determines exactly how to transfer a sample x to an augmented one ˆx ˆ X for the discriminator. After manually re-scaling (for ω [0, 1]d), the parameters of all three kinds of differentiable augmentation used in Diff Augment for 2D images can be expressed as follows: color: ωcolor = (λbrightness, λsaturation, λcontrast) [0, 1]3; translation: ωtranslation = (xtranslation, ytranslation) [0, 1]2; cutout: ωcutout = (xoffset, yoffset) [0, 1]2. Figure 1 illustrates the augmentation operations and their parameters. Formally, the loss functions for the discriminator and the generator of GANs with Diff Augment are defined as follows: Lda D = Ex pdata(x),ω p(ω)[f(D(T(x; ω)))] + Ez p(z),ω p(ω)[h(D(T(G(z); ω)))], (3) Lda G = Ez p(z),ω p(ω)[g(D(T(G(z); ω)))], (4) where ω can represent any combination of these parameters. We choose all augmentations by default, which means augmentation color, translation, and cutout are adopted for each image sequentially. Data augmentation for GANs allows the discriminator to distinguish a single sample from multiple perspectives by transforming it into various augmented samples according to different augmentation parameters. However, it overlooks the differences in augmentation intensity, such as color contrast and translation magnitude, leading the discriminator to implicitly maintain invariance to these varying intensities. The invariance may limit the representation learning ability of the discriminator because it loses augmentation-related information (e.g., color and position) [24]. Figure 2 confirms the impact of this point on the discriminator representation learning task [4]. We argue that a discriminator that captures comprehensive representations contributes to better convergence of the generator [35, 22]. Moreover, data augmentation may lead to augmentation leaking in generated data, when using specific data augmentations such as random 90-degree rotations [18, 12]. Therefore, our goal is to eliminate the unnecessary potential inductive bias (invariance to augmentations) for the discriminator while preserving the benefits of data augmentation for training data-efficient GANs. To achieve this goal, we propose a novel augmentation-aware self-supervised discriminator ˆD : ˆ X X Ω+ Ω that predicts the augmentation parameter and authenticity of the augmented data given the original data as reference. Distinguishing between the real data and the generated data with different self-supervision is because they are different during training, especially in the early stage. Specifically, the predictive targets of real data and generated data are represented as ω+ Ω+ and ω Ω , respectively. They are constructed from the augmentation parameter ω with different transformations, i.e., ω+ = ω = ω. Since the augmentation parameter is a continuous Figure 3: Diagram of Aug Self-GAN. The original augmentation-based discriminator is D(T( )) = ψ(ϕ(T( ))). The augmentation-aware self-supervised discriminator is ˆD(T( ), ) = φ(ϕ(T( )) ϕ( )), where φ is our newly introduced linear layer with negligible additional parameters. vector, we use mean squared error loss to regress it. The proposed method combines continuous self-supervised signals with real-vs-fake discrimination signals, thus can be considered as soft-label augmentation [12]. Comparison with self-supervision that does not distinguish between real and fake is referred to Table 6 in Appendix C. Notice that the predictive targets (augmentations) can be a subset of performed augmentations (see Table 7 in Appendix C for comparison). Mathematically, the loss function for the augmentation-aware self-supervised discriminator is formulated as the following: Lss ˆ D = Ex,ω h ˆD(T(x; ω), x) ω+ 2 2 i + Ez,ω h ˆD(T(G(z); ω), G(z)) ω 2 2 i . (5) In our implementations, the proposed self-supervised discriminator ˆD = φ ϕ shares the backbone ϕ : X Rd with the original discriminator D = ψ ϕ except the output linear layer φ : Rd Ω+ Ω . This parameter-sharing design not only improves the representation learning ability of the original discriminator but also saves the number of parameters in our model compared to the base model, e.g., 0.04% more parameters in Big GAN and 0.01% in Style GAN2. More specifically, the self-supervised discriminator predicts the target based on the difference between learned representations of the augmented data and the original data, i.e., ˆD(T(x; ω), x) = φ(ϕ(T(x; ω)) ϕ(x)) (see Table 8 in Appendix C for comparison with other architectures). The philosophy behind our design is that the backbone ϕ should capture rich (which necessitates the design of a simple head φ) and linear (inspiring us to perform subtraction on the features) representations. In order for the generator to directly benefit from the self-supervision of data augmentation, we establish a novel adversarial game between the augmentation-aware self-supervised discriminator and the generator with the objective function for the generator defined as follows: Lss G = Ez,ω h ˆD(T(G(z); ω), G(z)) ω+ 2 2 i Ez,ω h ˆD(T(G(z); ω), G(z)) ω 2 2 i . (6) The objective function is actually the combination of the non-saturating loss (regarding the generated data as real, min G Ez,ω[ ˆD(T(G(z); ω), G(z)) ω+ 2 2]) and the saturating loss (reversely optimizing the objective function of the discriminator, max G Ez,ω[ ˆD(T(G(z); ω), G(z)) ω 2 2]) (see Table 9 in Appendix C for ablation). Intuitively, the non-saturating loss encourages the generator to produce augmentation-predictable data, facilitating fidelity but reducing diversity. Conversely, the saturating loss strives for the generator to avoid generating augmentation-predictable data, promoting diversity at the cost of fidelity. We will elucidate in Section 5 how this formalization assists the generator in matching the fidelity and diversity of real data, ultimately leading to an accurate approximation of the target data distribution. The total objective functions for the original discriminator, the augmentation-aware self-supervised discriminator, and the generator of our proposed method, named Aug Self-GAN, are given by: min D, ˆ D Lda D + λd Lss ˆ D, (7) min G Lda G + λg Lss G, (8) 0 1 2 3 4 5 2 f(p(x)/q(x)) KL r KL JS LC AHM Figure 4: Comparison of the function f in different f-divergences. The f-divergence between two probability distributions p(x) and q(x) is defined as Df(p(x) q(x)) = R X q(x)f(p(x)/q(x))dx with a convex function f : R 0 R satisfying f(1) = 0. The xand y-axis denote the input and the value of the function f in the f-divergence. The function f of the AHM divergence yields the most robust value for large inputs. where the hyper-parameters are set as λd = λg = 1 in experiments by default unless otherwise specified (see Figure 6 for empirical studies). Details of objective functions are referred to Appendix B. 5 Theoretical Analysis In this section, we analyze the connection between the theoretical learning objective of Aug Self-GAN and the arithmetic harmonic mean (AHM) divergence [37] under certain assumptions. Proposition 1. For any generator G and given unlimited capacity in the function space, the optimal augmentation-aware self-supervised discriminator ˆD has the form of: ˆD (ˆx, x) = R pdata(x, ω, ˆx)ω+dω + R p G(x, ω, ˆx)ω dω pdata(x, ˆx) + p G(x, ˆx) (9) The proofs of all theoretical results (including the following ones) are deferred in Appendix A. Theorem 1. Assume that ω+ = ω = c is constant, under the optimal self-supervised discriminator ˆD , optimizing the self-supervised task for the generator G is equivalent to: min G 4c MAH(pdata(x, ˆx) p G(x, ˆx)), (10) where c = c 2 2 is constant and MAH is the arithmetic harmonic mean divergence [37], of which the minimum is achieved if and only if p G(x, ˆx) = pdata(x, ˆx) p G(x) = pdata(x). Theorem 1 reveals that the generator of Aug Self-GAN theoretically still satisfies generative modeling, i.e., accurately learning the target data distribution, under certain assumptions. Although Aug Self GAN does not obey the strict assumption, we note that this is not rare in the literature.2 Under this assumption, Aug Self-GAN can be regarded as a multi-dimensional extension of LS-GAN [29] in terms of the loss function, while excluding that of the generator. Additionally, our analysis offers an alternative theoretically grounded generator loss function for the LS-GAN family.3 Corollary 1. The following equality and inequality hold for the AHM divergence: MAH(pdata(x, ˆx) p G(x, ˆx)) + MAH(p G(x, ˆx) pdata(x, ˆx)) = (pdata(x, ˆx) p G(x, ˆx)) MAH(pdata(x, ˆx) p G(x, ˆx)) = 1 W(pdata(x, ˆx) p G(x, ˆx)) 1 where is the Le Cam (LC) divergence [23], and W is the harmonic mean divergence [37]. Corollary 1 reveals an inequality MAH(pdata(x, ˆx) p G(x, ˆx)) (pdata(x, ˆx) p G(x, ˆx)) because of non-negativity of AHM divergence MAH(p G(x, ˆx) pdata(x, ˆx)) 0. Figure 4 plots the function f in AHM divergence and other common f-divergences in the GAN literature. The AHM divergence shows better robustness of the function f than others for extremely large inputs p(x)/q(x) = D (x)/(1 D (x)), which is likely for the optimal discriminator D in data-limited scenarios. 2Goodfellow et al. [10] analyzed that saturated GAN optimizes the Jensen Shannon (JS) divergence, but in fact it uses non-saturated loss. Le Cam-GAN [40] showed a connection between the Le Cam (LC) divergence [23] and its objective function based on fixed regularization, but in practice it uses exponential moving average. 3The theoretical analysis of LS-GAN is actually inconsistent with its generator loss function. Table 1: IS and FID comparisons of Aug Self-Big GAN with state-of-the-art methods on CIFAR-10 and CIFAR-100 with full and limited data. The best result is bold and the second best is underlined. Method 100% training data 20% training data 10% training data IS ( ) FID ( ) IS ( ) FID ( ) IS ( ) FID ( ) Big GAN [2] 9.07 9.59 8.52 21.58 7.09 39.78 Diff Augment [47] 9.16 8.70 8.65 14.04 8.09 22.40 CR-GAN [46] 9.17 8.49 8.61 12.84 8.49 18.70 Le Cam-GAN [40] 9.43 8.28 8.83 12.56 8.57 17.68 Dig GAN [8] 9.28 8.49 8.89 13.01 8.32 17.87 Tickets [3] - 8.19 - 12.83 - 16.74 Masked GAN [14] - 8.41 .03 - 12.51 .09 - 15.89 .12 Gen Co [7] - 7.98 .02 - 12.61 .05 - 18.10 .06 Aug Self-Big GAN 9.43 .14 7.68 .06 8.98 .09 10.97 .09 8.76 .05 15.68 .26 Aug Self-Big GAN+ 9.27 .05 7.54 .04 9.08 .04 9.95 .17 8.79 .04 12.76 .14 Big GAN [2] 10.71 12.87 8.58 33.11 6.74 66.71 Diff Augment [47] 10.66 12.00 9.47 22.14 8.38 33.70 CR-GAN [46] 10.81 11.25 9.12 20.28 8.70 26.90 Le Cam-GAN [40] 11.05 11.20 9.81 18.03 9.27 27.63 Dig GAN [8] 11.45 11.63 9.54 19.79 8.98 24.59 Tickets [3] - 10.73 - 17.43 - 23.80 Masked GAN [14] - 11.65 .03 - 18.33 .09 - 24.02 .12 Gen Co [7] - 10.92 .02 - 18.44 .04 - 25.22 .06 Aug Self-Big GAN 11.19 .09 9.88 .07 10.25 .06 16.11 .25 9.78 .08 21.30 .15 Aug Self-Big GAN+ 11.12 .10 10.09 .05 10.14 .11 15.33 .20 9.93 .06 18.64 .09 Table 2: FID comparison of Aug Self-Style GAN2 with competing methods on FFHQ and LSUN-Cat with limited training samples. The best result is highlighted in bold and the second best is underlined. Method FFHQ LSUN-Cat 30K 10K 5K 1K 30K 10K 5K 1K Style GAN2 [19] 6.16 14.75 26.60 62.16 10.12 17.93 34.69 182.85 + ADA [18] 5.46 8.13 10.96 21.29 10.50 13.13 16.95 43.25 + Diff Augment [47] 5.05 7.86 10.45 25.66 9.68 12.07 16.11 42.26 Aug Self-Style GAN2 4.95 6.98 9.69 23.38 9.22 11.98 14.86 36.76 Aug Self-Style GAN2+ 5.82 6.65 9.15 20.39 9.43 12.00 14.12 26.52 6 Experiments We implement Aug Self-GAN based on Diff Augment [47], keeping the backbones and settings unchanged for fair comparisons with prior work under two evaluation metrics, IS [34] and FID [11]. The mean and standard deviation (if reported) are obtained with five evaluation runs at the best FID checkpoint. Each of experiments in this work was conducted on an 32GB NVIDIA V100 GPU. 6.1 Comparison with State-of-the-Art Methods CIFAR-10 and CIFAR-100. Table 1 reports the results on CIFAR-10 and CIFAR-100 [21]. These experiments are based on the Big GAN architecture [2]. Our method significantly outperforms the direct baseline Diff Augment [47] and yields the best generation performance in terms of FID and IS compared with SOTA methods. Notably, our method achieves further improvement when using stronger augmentation (see Table 5), i.e., Aug Self-Big GAN+ (translation and cutout ). FFHQ and LSUN-Cat. Table 2 reports the FID results on FFHQ [17] and LSUN-Cat [45]. The hyper-parameter is λg = 0.2. Aug Self-GAN performs substantially better than baselines with the same network backbone. Also, stronger augmentation, i.e., Aug Self-Style GAN2+ (translation and cutout ), further improves the performance when training data is very limited. Obama Grumpy cat Panda Animal Face Cat Animal Face Dog Figure 5: Qualitative comparison between Diff Aug (DA) and Aug Self (AS) on low-shot generation. Table 3: FID comparison of Aug Self-Style GAN2 with state-of-the-art methods with and without pre-training on five low-shot datasets. The best result is in bold and the second best is underlined. Method Pre-training? 100-shot Animal Faces Obama Grumpy cat Panda Cat Dog Scale/shift [31] Yes 50.72 34.20 21.38 54.83 83.04 Mine GAN [42] Yes 50.63 34.54 14.84 54.45 93.03 Transfer GAN [41] Yes 48.73 34.06 23.20 52.61 82.38 Freeze D [30] Yes 41.87 31.22 17.95 47.70 70.46 Style GAN2 [19] No 80.20 48.90 34.27 71.71 130.19 + Adv Aug [3] No 52.86 31.02 14.75 47.40 68.28 + ADA [18] No 45.69 26.62 12.90 40.77 56.83 + APA [16] No 42.97 28.10 19.21 42.60 81.16 + Diff Augment [47] No 46.87 27.08 12.06 42.44 58.85 Aug Self-Style GAN2 No 26.00 19.81 8.36 30.53 48.19 Low-shot image generation. Table 3 shows the FID scores on five common low-shot image generation benchmarks [36] (Obama, Grumpy cat, Panda, Animal Face cat, and Animal Face dog). The baselines are divided into two categories according to whether they were pre-trained. Due to its training stability, we can train Aug Self-Style GAN2 for 5k generator update steps to ensure convergence. The hyper-parameters are λd = λg = 0.1 on Grumpy cat and Animal Face cat, and the self-supervision is color on all datasets. Impressively, Aug Self-Style GAN2 surpasses competing methods by a large margin on all low-shot datasets, approaching SOTA FID scores to our knowledge. Figure 5 visually compares the generated images of Diff Augment and Aug Self-GAN, revealing the latter as superior in generating images with more diversity and fewer artifacts. 6.2 Analysis of Aug Self-GAN Fixed supervision. Table 4 reports the results of Aug Self-GAN on CIFAR-10 and CIFAR-100 compared to the setup using fixed self-supervision, i.e., ω+ = ω = 1, which corresponds to the assumption in Theorem 1. Aug Self-GAN outperforms the fixed self-supervision in terms of both IS and FID in all training data regimes. The reason is somewhat intuitive, as fixed self-supervision does not constitute self-supervised learning for the model, therefore it cannot enable the discriminator to learn semantic information related to data augmentation to optimize the generator. Interestingly, the fixed one beats the baseline, which may be attributed to the multi-input [27] and regression loss [29]. Table 4: FID comparison with fixed self-supervision. The best results are highlighted in bold. Method 100% training data 20% training data 10% training data IS ( ) FID ( ) IS ( ) FID ( ) IS ( ) FID ( ) Fixed (c = 1) 9.25 .17 8.01 .05 8.70 .12 12.58 .16 8.53 .05 17.66 .55 Aug Self-GAN 9.43 .14 7.68 .06 8.98 .09 10.97 .09 8.76 .05 15.68 .26 Fixed (c = 1) 10.67 .06 12.02 .07 9.94 .06 17.70 .17 9.50 .13 22.84 .28 Aug Self-GAN 11.19 .09 9.88 .07 10.25 .06 16.11 .25 9.78 .08 21.30 .15 Table 5: Study on stronger augmentation. The best is in bold and the second best is underlined. Method 100% training data 20% training data 10% training data IS ( ) FID ( ) IS ( ) FID ( ) IS ( ) FID ( ) Diff Augment 9.29 .02 8.48 .13 8.84 .12 15.14 .47 8.80 .01 20.60 .13 + trans. cut. 9.28 .06 8.42 .18 8.78 .06 14.28 .27 8.69 .07 20.93 .21 Aug Self-GAN 9.43 .14 7.68 .06 8.98 .09 10.97 .09 8.76 .05 15.68 .26 + trans. cut. 9.27 .05 7.54 .04 9.08 .04 9.95 .17 8.79 .04 12.76 .14 Diff Augment 11.02 .07 11.49 .21 9.45 .05 24.98 .48 8.50 .09 34.92 .63 + trans. cut. 11.10 .08 11.28 .20 9.58 .05 24.10 .66 8.59 .04 35.32 .46 Aug Self-GAN 11.19 .09 9.88 .07 10.25 .06 16.11 .25 9.78 .08 21.30 .15 + trans. cut. 11.12 .10 10.09 .05 10.14 .11 15.33 .20 9.93 .06 18.64 .09 0 0.1 0.2 0.5 1 2 5 10 0 0.1 0.2 0.5 1 2 5 10 10% data 20% data 100% data Figure 6: FID curves with varying hyper-parameters λ = λd = λg [0, 10] on CIFAR-10 and CIFAR-100. The hyper-parameter λ = 0 corresponds to the baseline Big GAN + Diff Augment. Stronger augmentation. Translation and cutout actually erase parts of image information, which help prevent the discriminator from overfitting, but could suffer from underfitting if excessive. Our self-supervised task enables the discriminator to be aware of different levels of translation and cutout, which helps alleviate underfitting and allows us to explore stronger translation and cutout. Table 5 compares Aug Self-GAN with Diff Augment in this setting. Overall, when data is limited, Aug Self-GAN can further benefit from stronger translation and cutout and achieve new SOTA FID results, while Diff Augment cannot. This implicitly indicates that our method enables the model o learn meaningful features to overcome underfitting, even under strong data augmentation. Hyper-parameters. Figure 6 plots the FID results of Aug Self-GAN with different hyper-parameters λ = λd = λg ranging from [0, 10] on CIFAR-10 and CIFAR-100. Notice that λ = 0 corresponds to the baseline Big GAN + Diff Augment. Aug Self-Big GAN performs the best when λ is near 1. It is worth noting that Aug Self-Big GAN outperforms the baseline even for λ = 10 with 10% and 20% training data, demonstrating superior robustness with respect to the hyper-parameter λ. 7 Conclusion This paper proposes a data-efficient GAN training method by utilizing augmentation parameters as self-supervision. Specifically, a novel self-supervised discriminator is proposed for predicting the augmentation parameters and data authenticity of augmented (real and generated) data simultaneously, given the original data. Meanwhile, the generator is encouraged to generate real rather than fake data of which augmentation parameters can be recognized by the self-supervised discriminator after augmentation. Theoretical analysis reveals a connection between the optimization objective of the generator and the arithmetic harmonic mean divergence under certain assumptions. Experiments on data-limited benchmarks demonstrate superior qualitative and quantitative performance of the proposed method compared to previous methods. Limitations. In our experiments, we observed less significant improvement of Aug Self-GAN under sufficient training data. Furthermore, its effectiveness depends on the specific data augmentation used. In some cases, inappropriate data augmentation may limit the performance gain. Broader impacts. This work aims at improving GANs under limited training data. While this may result in negative societal impacts, such as lowering the threshold of generating fake content or exacerbating bias and discrimination due to data issues, we believe that these risks can be mitigated. By establishing ethical guidelines for users and exploring fake content detection techniques, one can prevent these undesirable outcomes. Furthermore, this work contributes to the overall development of GANs and even generative models, ultimately promoting their potential benefits for society. Acknowledgements We thank the anonymous reviewers for their valuable and constructive feedback. This work is funded by the National Natural Science Foundation of China under Grant Nos. 62272125, 62102402, U21B2046. Huawei Shen is also supported by Beijing Academy of Artificial Intelligence (BAAI). [1] Gulcin Baykal, Furkan Ozcelik, and Gozde Unal. Exploring deshufflegans in self-supervised generative adversarial networks. Pattern Recognition, 2022. ISSN 0031-3203. [2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019. [3] Tianlong Chen, Yu Cheng, Zhe Gan, Jingjing Liu, and Zhangyang Wang. Data-efficient gan training beyond (just) augmentations: A lottery ticket perspective. In Advances in Neural Information Processing Systems, 2021. [4] Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, and Neil Houlsby. Self-supervised gans via auxiliary rotation loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [5] Xuxi Chen, Zhenyu Zhang, Yongduo Sui, and Tianlong Chen. {GAN}s can play lottery tickets too. In International Conference on Learning Representations, 2021. [6] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [7] Kaiwen Cui, Jiaxing Huang, Zhipeng Luo, Gongjie Zhang, Fangneng Zhan, and Shijian Lu. Genco: Generative co-training for generative adversarial networks with limited data. Proceedings of the AAAI Conference on Artificial Intelligence, 2022. doi: 10.1609/aaai.v36i1. 19928. [8] Tiantian Fang, Ruoyu Sun, and Alex Schwing. Diggan: Discriminator gradient gap regularization for gan training with limited data. In Advances in Neural Information Processing Systems, 2022. [9] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, 2018. [10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014. [11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, 2017. [12] Liang Hou, Huawei Shen, Qi Cao, and Xueqi Cheng. Self-supervised gans with label augmentation. In Advances in Neural Information Processing Systems, 2021. [13] Liang Hou, Qi Cao, Huawei Shen, Siyuan Pan, Xiaoshuang Li, and Xueqi Cheng. Conditional GANs with auxiliary discriminative classifier. In Proceedings of the 39th International Conference on Machine Learning, Proceedings of Machine Learning Research, 2022. [14] Jiaxing Huang, Kaiwen Cui, Dayan Guan, Aoran Xiao, Fangneng Zhan, Shijian Lu, Shengcai Liao, and Eric Xing. Masked generative adversarial networks are data-efficient generation learners. In Advances in Neural Information Processing Systems, 2022. [15] Jongheon Jeong and Jinwoo Shin. Training {gan}s with stronger augmentations via contrastive discriminator. In International Conference on Learning Representations, 2021. [16] Liming Jiang, Bo Dai, Wayne Wu, and Chen Change Loy. Deceive d: Adaptive pseudo augmentation for gan training with limited data. In Advances in Neural Information Processing Systems, 2021. [17] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [18] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. In Advances in Neural Information Processing Systems, 2020. [19] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [20] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. In Advances in Neural Information Processing Systems, 2021. [21] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Master s thesis, Department of Computer Science, University of Toronto, 2009. [22] Nupur Kumari, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Ensembling off-the-shelf models for gan training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [23] Lucien Le Cam. Asymptotic methods in statistical decision theory. 2012. [24] Hankook Lee, Kibok Lee, Kimin Lee, Honglak Lee, and Jinwoo Shin. Improving transferability of representations via augmentation-aware self-supervision. In Advances in Neural Information Processing Systems, 2021. [25] Kwot Sin Lee, Ngoc-Trung Tran, and Ngai-Man Cheung. Infomax-gan: Improved adversarial image generation via information maximization and contrastive learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021. [26] Jae Hyun Lim and Jong Chul Ye. Geometric gan. ar Xiv preprint ar Xiv:1705.02894, 2017. [27] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples in generative adversarial networks. In Advances in Neural Information Processing Systems, 2018. [28] Bingchen Liu, Yizhe Zhu, Kunpeng Song, and Ahmed Elgammal. Towards faster and stabilized {gan} training for high-fidelity few-shot image synthesis. In International Conference on Learning Representations, 2021. [29] Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017. [30] Sangwoo Mo, Minsu Cho, and Jinwoo Shin. Freeze the discriminator: a simple baseline for fine-tuning gans. ar Xiv preprint ar Xiv:2002.10964, 2020. [31] Atsuhiro Noguchi and Tatsuya Harada. Image generation from small datasets via batch statistics adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. [32] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, 2016. [33] Parth Patel, Nupur Kumari, Mayank Singh, and Balaji Krishnamurthy. Lt-gan: Self-supervised gan with latent transformation detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021. [34] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, 2016. [35] Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected gans converge faster. In Advances in Neural Information Processing Systems, 2021. [36] Zhangzhang Si and Song-Chun Zhu. Learning hybrid image templates (hit) by information projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011. [37] Inder Jeet Taneja. On mean divergence measures. Advances in Inequalities from probability theory and statistics. Nova, USA, 2008. [38] Ngoc-Trung Tran, Viet-Hung Tran, Bao-Ngoc Nguyen, Linxiao Yang, and Ngai-Man (Man) Cheung. Self-supervised gan: Analysis and improvement with multi-class minimax game. In Advances in Neural Information Processing Systems, 2019. [39] Ngoc-Trung Tran, Viet-Hung Tran, Ngoc-Bao Nguyen, Trung-Kien Nguyen, and Ngai-Man Cheung. On data augmentation for gan training. IEEE Transactions on Image Processing, 2021. doi: 10.1109/TIP.2021.3049346. [40] Hung-Yu Tseng, Lu Jiang, Ce Liu, Ming-Hsuan Yang, and Weilong Yang. Regularizing generative adversarial networks under limited data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. [41] Yaxing Wang, Chenshen Wu, Luis Herranz, Joost van de Weijer, Abel Gonzalez-Garcia, and Bogdan Raducanu. Transferring gans: generating images from limited data. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. [42] Yaxing Wang, Abel Gonzalez-Garcia, David Berga, Luis Herranz, Fahad Shahbaz Khan, and Joost van de Weijer. Minegan: Effective knowledge transfer from gans to target domains with few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [43] Ceyuan Yang, Yujun Shen, Yinghao Xu, and Bolei Zhou. Data-efficient instance generation from instance discrimination. In Advances in Neural Information Processing Systems, 2021. [44] mengping yang, Zhe Wang, Ziqiu Chi, and Yanbing Zhang. Fregan: Exploiting frequency components for training gans under limited data. In Advances in Neural Information Processing Systems, 2022. [45] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. ar Xiv preprint ar Xiv:1506.03365, 2015. [46] Han Zhang, Zizhao Zhang, Augustus Odena, and Honglak Lee. Consistency regularization for generative adversarial networks. In International Conference on Learning Representations, 2020. [47] Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. Differentiable augmentation for data-efficient gan training. In Advances in Neural Information Processing Systems, 2020. [48] Zhengli Zhao, Zizhao Zhang, Ting Chen, Sameer Singh, and Han Zhang. Image augmentations for gan training. ar Xiv preprint ar Xiv:2006.02595, 2020. Proposition 1. For any generator G and given unlimited capacity in the function space, the optimal augmentation-aware self-supervised discriminator ˆD has the form of: ˆD (ˆx, x) = R pdata(x, ω, ˆx)ω+dω + R p G(x, ω, ˆx)ω dω pdata(x, ˆx) + p G(x, ˆx) (9) Proof. The objective function of the self-supervised discriminator can be written as follows: Lss ˆ D = Ex,ω h ˆD(T(x; ω), x) ω+ 2 2 i + Ez,ω h ˆD(T(G(z); ω), G(z)) ω 2 2 i = ZZZ h pdata(x, ω, ˆx) ˆD(ˆx, x) ω+ 2 2 + p G(x, ω, ˆx) ˆD(ˆx, x) ω 2 2 i dxdωdˆx. Minimizing the integral objective is equivalent to minimizing the objective on each data point: Lss ˆ D(x, ˆx) = Z h pdata(x, ω, ˆx) ˆD(ˆx, x) ω+ 2 2 + p G(x, ω, ˆx) ˆD(ˆx, x) ω 2 2 i dω. Getting its derivative with respect to the self-supervised discriminator and let it equals to 0, we have the optimal self-supervised discriminator as: Lss ˆ D ˆD(ˆx, x) = Z h pdata(x, ω, ˆx)2( ˆD(ˆx, x) ω+) + p G(x, ω, ˆx)2( ˆD(ˆx, x) ω ) i dω = 0 ˆD (ˆx, x) = R pdata(x, ω, ˆx)ω+dω + R p G(x, ω, ˆx)ω dω pdata(x, ˆx) + p G(x, ˆx) . Theorem 1. Assume that ω+ = ω = c is constant, under the optimal self-supervised discriminator ˆD , optimizing the self-supervised task for the generator G is equivalent to: min G 4c MAH(pdata(x, ˆx) p G(x, ˆx)), (10) where c = c 2 2 is constant and MAH is the arithmetic harmonic mean divergence [37], of which the minimum is achieved if and only if p G(x, ˆx) = pdata(x, ˆx) p G(x) = pdata(x). Proof. The objective function of the self-supervised task for the generator can be written as follows: Ez,ω h ˆD(T(G(z); ω), G(z)) ω+ 2 2 i Ez,ω h ˆD(T(G(z); ω), G(z)) ω 2 2 i = ZZZ p G(x, ω, ˆx) h ˆD(x, ˆx) ω+ 2 2 ˆD(x, ˆx) ω 2 2 i dxdωdˆx = ZZ p G(x, ˆx) c(pdata(x, ˆx) p G(x, ˆx)) pdata(x, ˆx) + p G(x, ˆx) c 2 2 c(pdata(x, ˆx) p G(x, ˆx)) pdata(x, ˆx) + p G(x, ˆx) + c 2 2 = c ZZ p G(x, ˆx) pdata(x, ˆx) p G(x, ˆx) pdata(x, ˆx) + p G(x, ˆx) 1 2 2 pdata(x, ˆx) p G(x, ˆx) pdata(x, ˆx) + p G(x, ˆx) + 1 2 2 = c ZZ p G(x, ˆx) 2p G(x, ˆx) pdata(x, ˆx) + p G(x, ˆx) 2 2 2pdata(x, ˆx) pdata(x, ˆx) + p G(x, ˆx) 2 2 = 4c ZZ p G(x, ˆx) p G(x, ˆx)2 pdata(x, ˆx)2 (pdata(x, ˆx) + p G(x, ˆx))2 = 4c ZZ p G(x, ˆx) p G(x, ˆx) pdata(x, ˆx) pdata(x, ˆx) + p G(x, ˆx) = 4c MAH(pdata(x, ˆx) p G(x, ˆx)), where c = c 2 2 is a constant scalar for the constant vector c = ω+ = ω . And we have the optimal generator p G(x, ˆx) = pdata(x, ˆx) p G(x) = pdata(x) to minimize the AHM divergence. Table 6: IS and FID of Aug Self-Big GAN with different self-supervised tasks (SS: Equations (19) and (21); SS+: Equations (20) and (21); ASS: Equations (5) and (6)) on CIFAR-10 and CIFAR-100 with full and limited training data. The best is bold and the second best is underlined. Method 100% training data 20% training data 10% training data IS ( ) FID ( ) IS ( ) FID ( ) IS ( ) FID ( ) Baseline 9.29 0.02 8.48 0.13 8.84 0.12 15.14 0.47 8.80 0.01 20.60 0.13 SS (λg = 0) 9.28 0.06 8.30 0.12 8.86 0.09 13.96 0.19 8.62 0.09 20.94 0.42 SS 9.29 0.07 8.24 0.10 8.98 0.04 13.42 0.36 8.69 0.05 19.40 0.28 SS+ (λg = 0) 9.25 0.06 8.11 0.18 8.81 0.05 13.86 0.31 8.76 0.06 19.52 0.24 SS+ 9.26 0.04 8.26 0.28 8.85 0.05 13.81 0.27 8.63 0.05 19.85 0.18 ASS (λg = 0) 9.12 0.05 8.90 0.07 8.82 0.03 13.65 0.70 8.52 0.04 18.14 0.33 ASS 9.43 0.14 7.68 0.06 8.98 0.09 10.97 0.09 8.76 0.05 15.68 0.26 Baseline 11.02 0.07 11.49 0.21 9.45 0.05 24.98 0.48 8.50 0.09 34.92 0.63 SS (λg = 0) 11.04 0.09 11.48 0.43 9.81 0.07 21.17 0.19 9.11 0.09 30.48 0.57 SS 10.96 0.11 11.04 0.15 9.99 0.09 20.72 0.33 9.23 0.07 31.54 1.12 SS+ (λg = 0) 10.84 0.09 11.48 0.13 9.95 0.07 21.22 0.57 9.14 0.05 31.02 1.00 SS+ 10.94 0.10 10.94 0.12 9.88 0.06 22.72 0.37 9.23 0.14 31.40 0.22 ASS (λg = 0) 10.82 0.10 11.29 0.12 9.96 0.11 18.90 0.41 9.45 0.05 25.77 0.95 ASS 11.19 0.09 9.88 0.07 10.25 0.06 16.11 0.25 9.78 0.08 21.30 0.15 Corollary 1. The following equality and inequality hold for the AHM divergence: MAH(pdata(x, ˆx) p G(x, ˆx)) + MAH(p G(x, ˆx) pdata(x, ˆx)) = (pdata(x, ˆx) p G(x, ˆx)) MAH(pdata(x, ˆx) p G(x, ˆx)) = 1 W(pdata(x, ˆx) p G(x, ˆx)) 1 where is the Le Cam (LC) divergence [23], and W is the harmonic mean divergence [37]. Proof. We first prove the first corollary: MAH(pdata(x, ˆx) p G(x, ˆx)) MAH(pdata(x, ˆx) p G(x, ˆx)) + MAH(p G(x, ˆx) pdata(x, ˆx)) = ZZ p G(x, ˆx)p G(x, ˆx) pdata(x, ˆx) pdata(x, ˆx) + p G(x, ˆx)dxdˆx + ZZ pdata(x, ˆx)pdata(x, ˆx) p G(x, ˆx) p G(x, ˆx) + pdata(x, ˆx)dxdˆx = ZZ p G(x, ˆx)p G(x, ˆx) pdata(x, ˆx) pdata(x, ˆx) + p G(x, ˆx)dxdˆx ZZ pdata(x, ˆx)p G(x, ˆx) pdata(x, ˆx) p G(x, ˆx) + pdata(x, ˆx)dxdˆx = ZZ (p G(x, ˆx) pdata(x, ˆx))2 pdata(x, ˆx) + p G(x, ˆx) dxdˆx = (pdata(x, ˆx) p G(x, ˆx)), where 0 (pdata(x, ˆx) p G(x, ˆx)) 2 is the Le Cam (LC) divergence [23]. The following proves the second corollary: 0 MAH(pdata(x, ˆx) p G(x, ˆx)) = ZZ p G(x, ˆx)p G(x, ˆx) pdata(x, ˆx) pdata(x, ˆx) + p G(x, ˆx)dxdˆx = ZZ p G(x, ˆx) 1 2pdata(x, ˆx) pdata(x, ˆx) + p G(x, ˆx) = 1 ZZ 2pdata(x, ˆx)p G(x, ˆx) pdata(x, ˆx) + p G(x, ˆx)dxdˆx = 1 W(pdata(x, ˆx) p G(x, ˆx)) 1, where 0 W(pdata(x, ˆx) p G(x, ˆx)) 1 is the well known harmonic mean divergence [37]. B Loss Functions Aug Self-GAN adopts all kinds of augmentations of Diff Augment as the self-supervised signals by default, thus the self-supervised loss functions of the discriminator and the generator (Equations (5) and (6)) actually each comprise three sub self-supervised loss functions. Specifically, for Aug Self GAN that uses color, translation, and cutout as self-supervision, the objective functions are: min D, ˆ Dcolor, ˆ Dtranslation, ˆ Dcutout Lda D + λd Lcolor ˆ Dcolor + Ltranslation ˆ Dtranslation + Lcutout ˆ Dcutout min G Lda G + λg Lcolor G + Ltranslation G + Lcutout G , (12) where Lcolor ˆ Dcolor, Ltranslation ˆ Dtranslation, Lcutout ˆ Dcutout, Lcolor G , Ltranslation G , and Lcutout G are defined as: Lcolor ˆ Dcolor = Ex,ω h ˆDcolor(T(x; ω), x) ω+ color 2 2 i + Ez,ω h ˆDcolor(T(G(z); ω), G(z)) ω color 2 2 i , (13) Ltranslation ˆ Dtranslation = Ex,ω h ˆDtranslation(T(x; ω), x) ω+ translation 2 2 i + Ez,ω h ˆDtranslation(T(G(z); ω), G(z)) ω translation 2 2 i , (14) Lcutout ˆ Dcutout = Ex,ω h ˆDcutout(T(x; ω), x) ω+ cutout 2 2 i + Ez,ω h ˆDcutout(T(G(z); ω), G(z)) ω cutout 2 2 i , (15) Lcolor G = Ez,ω h ˆDcolor(T(G(z); ω), G(z)) ω+ color 2 2 i Ez,ω h ˆDcolor(T(G(z); ω), G(z)) ω color 2 2 i , (16) Ltranslation G = Ez,ω h ˆDtranslation(T(G(z); ω), G(z)) ω+ translation 2 2 i Ez,ω h ˆDtranslation(T(G(z); ω), G(z)) ω translation 2 2 i , (17) Lcutout G = Ez,ω h ˆDcutout(T(G(z); ω), G(z)) ω+ cutout 2 2 i Ez,ω h ˆDcutout(T(G(z); ω), G(z)) ω cutout 2 2 i , (18) where ˆDcolor = φcolor ϕ, ˆDtranslation = φtranslation ϕ, and ˆDcutout = φcutout ϕ share the backbone ϕ but differ in the heads φcolor, φtranslation, or φcutout, respectively. For the simplicity of notations, we write Equations (5) and (6) in the main text as the objective function. C Ablation Studies Self-supervised tasks. We introduce two non-adversarial self-supervised tasks for comparison. The first version is that the discriminator only learns self-supervision on real data, defined as: Lss ˆ D = Ex pdata(x),ω p(ω) h ˆD(T(x; ω), x) ω 2 2 i . (19) The second version is that the discriminator learns self-supervised tasks on both real and generated data simultaneously, given by: Lss ˆ D = Ex,ω h ˆD(T(x; ω), x) ω 2 2 i + Ez,ω h ˆD(T(G(z); ω), G(z)) ω 2 2 i . (20) For both versions, the generator is encouraged to produce augmentation-recognizable data, as follows: Lss G = Ez p(z),ω p(ω) h ˆD(T(G(z); ω), G(z)) ω 2 2 i . (21) According to the self-supervised task, we denote different approaches as SS (Equations (19) and (21)), SS+ (Equations (20) and (21)), and ASS (Equations (5) and (6), short for adversarial self-supervised learning, i.e., the proposed Aug Self-GAN). Table 7: IS and FID of Aug Self-Big GAN with different self-supervised signals on CIFAR-10 and CIFAR-100 with full and limited training data. The self supervised signal to be predicted is marked by the symbol . Notice that all methods adopt color, translation, and cutout as data augmentation. Self-Supervised Signals 100% training data 20% training data 10% training data color trans. cutout IS ( ) FID ( ) IS ( ) FID ( ) IS ( ) FID ( ) 9.29 .02 8.48 .13 8.84 .12 15.14 .47 8.80 .01 20.60 .13 9.30 .05 7.48 .07 8.94 .07 11.73 .27 8.73 .12 16.66 .39 9.39 .07 7.51 .12 8.95 .05 11.82 .21 8.80 .03 16.27 .35 9.38 .05 7.41 .11 8.87 .04 11.19 .08 8.63 .08 16.30 .57 9.50 .07 7.57 .07 8.99 .06 11.38 .14 8.72 .09 16.50 .14 9.41 .07 7.51 .05 8.92 .12 11.20 .17 8.83 .07 15.55 .21 9.42 .06 7.43 .06 8.91 .09 11.19 .20 8.58 .02 15.17 .15 9.43 .14 7.68 .06 8.98 .09 10.97 .09 8.76 .05 15.68 .26 11.02 .07 11.49 .21 9.45 .05 24.98 .48 8.50 .09 34.92 .63 11.16 .13 10.03 .14 10.18 .09 19.45 .62 9.46 .12 25.99 .62 11.35 .10 9.88 .08 10.01 .07 18.39 .07 9.29 .10 26.50 .36 11.26 .07 9.75 .05 10.35 .20 16.88 .46 9.92 .08 23.09 .72 11.20 .08 10.01 .13 9.90 .11 18.16 .62 9.39 .04 21.48 .14 11.17 .12 10.12 .20 10.14 .09 17.11 .62 9.94 .12 24.52 .76 11.26 .16 9.65 .04 10.21 .12 17.32 .51 9.92 .06 22.94 .17 11.19 .09 9.88 .07 10.25 .06 16.11 .25 9.78 .08 21.30 .15 Table 8: IS and FID of Aug Self-Big GAN with different architectures and fusions of φ on CIFAR-10 and CIFAR-100 with 10% training data. The best result is bold and the second best is underlined. Input can be concatenation [ϕ(x), ϕ(ˆx)] R2d, subtraction ϕ(ˆx) ϕ(x), and augmentation ϕ(ˆx). Architecture Input CIFAR-10 10% data CIFAR-100 10% data IS ( ) FID ( ) IS ( ) FID ( ) two-layer MLP concatenation 8.54 0.07 18.47 0.37 8.65 0.11 30.15 0.47 two-layer MLP subtraction 8.62 0.07 16.94 0.49 8.88 0.06 28.63 0.36 two-layer MLP augmentation 8.78 0.05 19.06 0.81 8.73 0.10 29.40 0.91 linear layer concatenation 8.85 0.11 17.52 0.20 9.37 0.09 25.22 0.25 linear layer subtraction 8.76 0.05 15.68 0.26 9.78 0.08 21.30 0.15 linear layer augmentation 8.66 0.09 20.29 0.36 9.92 0.10 24.47 0.58 bilinear layer - 8.57 0.04 25.27 0.16 9.03 0.03 26.72 0.17 Table 6 reports the comparison between methods with different self-supervised tasks. We also conducted generator-free self-supervised learning experiments by setting the hyper-parameter as λg = 0 on each kind of self-supervised task. According to the FID score, ASS is significantly superior to SS and SS+. Even without the self-supervised task of the generator, ASS (λg = 0) outperforms SS and SS+ that include generator self-supervised tasks in limited (10% and 20%) data, and the introduction of generator self-supervised tasks further expands this advantage. Self-supervised signals. We empirically analyze the role of different augmentation parameters as self-supervised signals for Aug Self-GAN. All methods employ three types of data augmentations (color, translation, and cutout), with the difference lying in the predicted self-supervised signals. According to the FID scores in Table 7, predicting any augmentation parameters can significantly improve the final generative performance compared to the baseline. Although there is no significant difference among them, we still choose to predict all augmentation parameters as the default setting for Aug Self-GAN. This experiment demonstrates that the self-supervised task itself plays a decisive role in training Aug Self-GAN, rather than the specific predicted augmentations. Therefore, we believe that our method is generalizable and can be extended to other advanced data augmentations. Network architectures. We investigate the impact of different network architectures (two-layer multi-layer perceptron (MLP), linear layer, and bilinear layer) and input approaches (concatenation [ϕ(x), ϕ(ˆx)], subtraction ϕ(ˆx) ϕ(x), and augmented samples only ϕ(ˆx)) on Aug Self-GAN. As Table 9: IS and FID of Aug Self-Big GAN with different loss functions on CIFAR-10 and CIFAR-100 with full and limited training data. The best result is bold and the second best is underlined. Method 100% training data 20% training data 10% training data IS ( ) FID ( ) IS ( ) FID ( ) IS ( ) FID ( ) λd = 0, λg = 0 9.29 .02 8.48 .13 8.84 .12 15.14 .47 8.80 .01 20.60 .13 λd = 1, λg = 0 9.12 .05 8.90 .07 8.82 .03 13.65 .70 8.52 .04 18.14 .33 saturating 9.27 .06 8.13 .14 8.83 .04 12.76 .29 8.42 .05 18.43 .21 non-saturating 9.37 .07 7.60 .08 8.91 .09 11.44 .16 8.63 .10 15.86 .26 combination 9.43 .14 7.68 .06 8.98 .09 10.97 .09 8.76 .05 15.68 .26 λd = 0, λg = 0 11.02 .07 11.49 .21 9.45 .05 24.98 .48 8.50 .09 34.92 .63 λd = 1, λg = 0 10.82 .10 11.29 .12 9.96 .11 18.90 .41 9.45 .05 25.77 .95 saturating 10.90 .07 10.53 .21 9.71 .08 18.78 .43 9.30 .09 25.92 .12 non-saturating 11.22 .14 9.90 .09 10.07 .02 16.12 .26 9.81 .10 21.99 .87 combination 11.19 .09 9.88 .07 10.25 .06 16.11 .25 9.78 .08 21.30 .15 reported in Table 8, we found that more complicated architectures of the predict head φ such as twolayer MLP with input concatenation ( ˆD(T(x, ω), x) = MLP([ϕ(T(x, ω)), ϕ(x)])) or bilinear layer ( ˆD(T(x; ω), x) = ϕ(T(x; ω)) Wϕ(x), W Rd d) works worse than the simple linear layer with input subtraction ( ˆD(T(x; ω), x) = φ(ϕ(T(x; ω)) ϕ(x))), which is the design of Aug Self-GAN. The reason might be that complicated architectures of the head φ actually discourage the backbone ϕ from capturing rich and linear representations, resulting in poor generalization of discriminators. Generator loss functions. We study the effects of Aug Self-GANs with different generator loss functions. As shown in Table 9, The hyper-parameters λd = 0, λg = 0 represent the baseline Big GAN + Diff Augment. The generation performance is improved with only augmentation-aware self-supervision for the discriminator (λd = 1, λg = 0). The saturating version of self-supervised generator loss shows no significant further improvement. The non-saturating version yields comparable performance with the combination one (Aug Self-GAN), and they both substantially surpass the others. The reason is that the non-saturating loss directly encourages the generator to generate augmentation-predictable real data, providing more informative guidance than the saturating one. D Additional Results 100% training data 20% training data 10% training data Figure 7: FID plots comparison of Aug Self-GAN and baselines on CIFAR-10 and CIFAR-100. Table 10: FID comparison on AFHQ. The number within the parentheses represents the percentage of improvement ( ) in Aug Self-Style GAN2 compared to the baseline Style GAN2 + Diff Augment. Method Cat Dog Wild Style GAN2 [19] 5.13 19.4 3.48 + Diff Augment [47] 3.49 8.75 2.69 Aug Self-Style GAN2 3.23 ( 7.45%) 8.17 ( 6.63%) 2.48 ( 7.81%) Style GAN2 + Diff Augment Aug Self-Style GAN2 Figure 8: Visual comparison between Style GAN2+Diff Augment and Aug Self-Style GAN2 on AFHQ. Training stability. Figure 7 shows the FID plots during training of our method and baselines on CIFAR-10 and CIFAR-100. Overall, Aug Self-Big GAN demonstrates a more stable convergence compared to the baselines, while Aug Self-Big GAN+ goes even further. AFHQ We also conduct experiments on the AFHQ dataset [6] to compare our proposed method with Diff Augment by employing Style GAN2 as the backbone model. The hyperparameters are set as λd = 0.1, λd = 0.5, and λd = 0.2 on Cat, Dog, and Wild domains, respectively, and λg = 0.1 on all three domains. Quantitative experimental results, as shown in Table 10, suggest that Aug Self-Style GAN2 yields approximately a 7% improvements in FID across all three domains when compared to Diff Augment. Figure 8 presents a qualitative comparison between these two methods, demonstrating a significant improvement in image quality for our method. FFHQ and LSUN-Cat Figures 9 and 10 present the randomly generated images produced by Style GAN2 + Diff Augment and Aug Self-Style GAN2+ on the FFHQ [17] and LSUN-Cat [45] datasets with varying amounts of training data, respectively. Our Aug Self-Style GAN2+ demonstrates significant superiority over Style GAN2 + Diff Augment in terms of visual quality of generated images. Style GAN2 + Diff Augment Aug Self-Style GAN2+ 1k training samples 5k training samples 10k training samples 30k training samples Figure 9: Comparison of random generated images between Style GAN2 + Diff Augment and Aug Self Style GAN2+ on FFHQ with different amounts of training data. Style GAN2 + Diff Augment Aug Self-Style GAN2+ 1k training samples 5k training samples 10k training samples 30k training samples Figure 10: Comparison of randomly generated images between Style GAN2 + Diff Augment and Aug Self-Style GAN2+ on LSUN-Cat with different amounts of training data.