# diffusiongan_training_gans_with_diffusion__ead34e56.pdf Published as a conference paper at ICLR 2023 DIFFUSION-GAN: TRAINING GANS WITH DIFFUSION Zhendong Wang1,2, Huangjie Zheng1,2, Pengcheng He2, Weizhu Chen2, Mingyuan Zhou1 1The University of Texas at Austin, 2Microsoft Azure AI {zhendong.wang, huangjie.zheng}@utexas.edu, {penhe,wzchen}@microsoft.com mingyuan.zhou@mccombs.utexas.edu Generative adversarial networks (GANs) are challenging to train stably, and a promising remedy of injecting instance noise into the discriminator input has not been very effective in practice. In this paper, we propose Diffusion-GAN, a novel GAN framework that leverages a forward diffusion chain to generate Gaussianmixture distributed instance noise. Diffusion-GAN consists of three components, including an adaptive diffusion process, a diffusion timestep-dependent discriminator, and a generator. Both the observed and generated data are diffused by the same adaptive diffusion process. At each diffusion timestep, there is a different noise-to-data ratio and the timestep-dependent discriminator learns to distinguish the diffused real data from the diffused generated data. The generator learns from the discriminator s feedback by backpropagating through the forward diffusion chain, whose length is adaptively adjusted to balance the noise and data levels. We theoretically show that the discriminator s timestep-dependent strategy gives consistent and helpful guidance to the generator, enabling it to match the true data distribution. We demonstrate the advantages of Diffusion-GAN over strong GAN baselines on various datasets, showing that it can produce more realistic images with higher stability and data efficiency than state-of-the-art GANs. 1 INTRODUCTION Generative adversarial networks (GANs) (Goodfellow et al., 2014) and their variants (Brock et al., 2018; Karras et al., 2019; 2020a; Zhao et al., 2020) have achieved great success in synthesizing photo-realistic high-resolution images. GANs in practice, however, are known to suffer from a variety of issues ranging from non-convergence and training instability to mode collapse (Arjovsky and Bottou, 2017; Mescheder et al., 2018). As a result, a wide array of analyses and modifications has been proposed for GANs, including improving the network architectures (Karras et al., 2019; Radford et al., 2016; Sauer et al., 2021; Zhang et al., 2019), gaining theoretical understanding of GAN training (Arjovsky and Bottou, 2017; Heusel et al., 2017; Mescheder et al., 2017; 2018), changing the objective functions (Arjovsky et al., 2017; Bellemare et al., 2017; Deshpande et al., 2018; Li et al., 2017a; Nowozin et al., 2016; Zheng and Zhou, 2021; Yang et al., 2021), regularizing the weights and/or gradients (Arjovsky et al., 2017; Fedus et al., 2018; Mescheder et al., 2018; Miyato et al., 2018a; Roth et al., 2017; Salimans et al., 2016), utilizing side information (Wang et al., 2018; Zhang et al., 2017; 2020b), adding a mapping from the data to latent representation (Donahue et al., 2016; Dumoulin et al., 2016; Li et al., 2017b), and applying differentiable data augmentation (Karras et al., 2020a; Zhang et al., 2020a; Zhao et al., 2020). A simple technique to stabilize GAN training is to inject instance noise, i.e., to add noise to the discriminator input, which can widen the support of both the generator and discriminator distributions and prevent the discriminator from overfitting (Arjovsky and Bottou, 2017; Sønderby et al., 2017). However, this technique is hard to implement in practice, as finding a suitable noise distribution is challenging (Arjovsky and Bottou, 2017). Roth et al. (2017) show that adding instance noise to the high-dimensional discriminator input does not work well, and propose to approximate it by adding a zero-centered gradient penalty on the discriminator. This approach is theoretically and empirically shown to converge in Mescheder et al. (2018), who also demonstrate that adding zero-centered gradient penalties to non-saturating GANs can result in stable training and better or comparable generation quality compared to WGAN-GP (Arjovsky et al., 2017). However, Brock Published as a conference paper at ICLR 2023 𝑥 𝑝(𝑥) 𝑦| 𝑡= 100 𝑦| 𝑡= 300 𝑦| 𝑡= 600 𝑦| 𝑡= 𝑇= 1000 𝑦! | 𝑡= 100 𝑦! | 𝑡= 300 𝑦! | 𝑡= 600 𝑦! | 𝑡= 𝑇= 1000 Timestep-Dependent Discriminator: 𝑫(𝒚, 𝒕) T is adaptively adjusted. Generator: 𝒙𝒈= 𝑮𝒛, 𝒛 𝒑(𝒛) Diffusion Process: 𝒚 𝒒𝒚𝒙, 𝒕, 𝒕 𝒑𝝅 Diffusion Process: 𝒚𝒈 𝒒𝒚𝒈𝒙𝒈, 𝒕, 𝒕 𝒑𝝅 Figure 1: Flowchart for Diffusion-GAN. The top-row images represent the forward diffusion process of a real image, while the bottom-row images represent the forward diffusion process of a generated fake image. The discriminator learns to distinguish a diffused real image from a diffused fake image at all diffusion steps. et al. (2018) caution that zero-centered gradient penalties and other similar regularization methods may stabilize training at the cost of generation performance. To the best of our knowledge, there has been no existing work that is able to empirically demonstrate the success of using instance noise in GAN training on high-dimensional image data. To inject proper instance noise that can facilitate GAN training, we introduce Diffusion-GAN, which uses a diffusion process to generate Gaussian-mixture distributed instance noise. We show a graphical representation of Diffusion-GAN in Figure 1. In Diffusion-GAN, the input to the diffusion process is either a real or a generated image, and the diffusion process consists of a series of steps that gradually add noise to the image. The number of diffusion steps is not fixed, but depends on the data and the generator. We also design the diffusion process to be differentiable, which means that we can compute the derivative of the output with respect to the input. This allows us to propagate the gradient from the discriminator to the generator through the diffusion process, and update the generator accordingly. Unlike vanilla GANs, which compare the real and generated images directly, Diffusion-GAN compares the noisy versions of them, which are obtained by sampling from the Gaussian mixture distribution over the diffusion steps, with the help of our timestep-dependent discriminator. This distribution has the property that its components have different noise-to-data ratios, which means that some components add more noise than others. By sampling from this distribution, we can achieve two benefits: first, we can stabilize the training by easing the problem of vanishing gradient, which occurs when the data and generator distributions are too different; second, we can augment the data by creating different noisy versions of the same image, which can improve the data efficiency and the diversity of the generator. We provide a theoretical analysis to support our method, and show that the min-max objective function of Diffusion-GAN, which measures the difference between the data and generator distributions, is continuous and differentiable everywhere. This means that the generator in theory can always receive a useful gradient from the discriminator, and improve its performance. Our main contributions include: 1) We show both theoretically and empirically how the diffusion process can be utilized to provide a modeland domain-agnostic differentiable augmentation, enabling data-efficient and leaking-free stable GAN training. 2) Extensive experiments show that Diffusion-GAN boosts the stability and generation performance of strong baselines, including Style GAN2 (Karras et al., 2020b), Projected GAN (Sauer et al., 2021), and Ins Gen (Yang et al., 2021), achieving state-of-the-art results in synthesizing photo-realistic images, as measured by both the Fr echet Inception Distance (FID) (Heusel et al., 2017) and Recall score (Kynk a anniemi et al., 2019). 2 PRELIMINARIES: GANS AND DIFFUSION-BASED GENERATIVE MODELS GANs (Goodfellow et al., 2014) are a class of generative models that aim to learn the data distribution p(x) of a target dataset by setting up a min-max game between two neural networks: a generator and a discriminator. The generator G takes as input a random noise vector z sampled from a simple prior distribution p(z), such as a standard normal or uniform distribution, and tries to produce realistic-looking samples G(z) that resemble the data. The discriminator D receives either Published as a conference paper at ICLR 2023 a real data sample x drawn from p(x) or a fake sample G(z) generated by G, and tries to correctly classify them as real or fake. The goal of G is to fool D into making mistakes, while the goal of D is to accurately distinguish G(z) from x. The min-max objective function of GANs is given by min G max D V (G, D) = Ex p(x)[log(D(x))] + Ez p(z)[log(1 D(G(z)))]. In practice, this vanilla objective function is often modified to improve the stability and performance of GANs(Goodfellow et al., 2014; Miyato et al., 2018a; Fedus et al., 2018), but the general idea of adversarial learning between G and D remains the same. Diffusion-based generative models (Ho et al., 2020b; Sohl-Dickstein et al., 2015; Song and Ermon, 2019) assume pθ(x0) := R pθ(x0:T )dx1:T , where x1, . . . , x T are latent variables of the same dimensionality as the data x0 p(x0). There is a forward diffusion chain that gradually adds noise to the data x0 q(x0) in T steps with pre-defined variance schedule βt and variance σ2: q(x1:T | x0) := QT t=1 q(xt | xt 1), q(xt | xt 1) := N(xt; 1 βtxt 1, βtσ2I). A notable property is that xt at an arbitrary time-step t can be sampled in closed form as q(xt | x0) = N(xt; αtx0, (1 αt)σ2I), where αt := 1 βt, αt := Qt s=1 αs. (1) A variational lower bound (Blei et al., 2017) is then used to optimize the reverse diffusion chain as pθ(x0:T ) := N(x T ; 0, σ2I) QT t=1 pθ(xt 1 | xt). 3 DIFFUSION-GAN: METHOD AND THEORETICAL ANALYSIS To construct Diffusion-GAN, we describe how to inject instance noise via diffusion, how to train the generator by backpropagating through the forward diffusion process, and how to adaptively adjust the diffusion intensity. We further provide theoretical analysis illustrated with a toy example. 3.1 INSTANCE NOISE INJECTION VIA DIFFUSION We aim to generate realistic samples xg from a generator network G that maps a latent variable z sampled from a simple prior distribution p(z) to a high-dimensional data space, such as images. The distribution of generator samples xg = G(z), z p(z) is denoted by pg(x) = R p(xg | z)p(z)dz. To make the generator more robust and diverse, we inject instance noise into the generated samples xg by applying a diffusion process that adds Gaussian noise at each step. The diffusion process can be seen as a Markov chain that starts from the original sample x and gradually erases its information until reaching a noise level σ2 after T steps. We define a mixture distribution q(y | x) that models the noisy samples y obtained at any step of the diffusion process, with a mixture weight πt for each step t. The mixture components q(y | x, t) are Gaussian distributions with mean proportional to x and variance depending on the noise level at step t. We use the same diffusion process and mixture distribution for both the real samples x p(x) and the generated samples xg pg(x). More specifically, the diffusion-induced mixture distributions are expressed as x p(x), y q(y | x), q(y | x) := PT t=1 πtq(y | x, t), xg pg(x), yg q(yg | xg), q(yg | xg) := PT t=1 πtq(yg | xg, t), where q(y | x) is a T-component mixture distribution, the mixture weights πt are non-negative and sum to one, and the mixture components q(y | x, t) are obtained via diffusion as in Equation (1), expressed as q(y | x, t) = N(y; αtx, (1 αt)σ2I). (2) Samples from this mixture can be drawn as t pπ := Discrete(π1, . . . , πT ), y q(y | x, t). By sampling y from this mixture distribution, we can obtain noisy versions of both real and generated samples with varying degrees of noise. The more steps we take in the diffusion process, the more noise we add to y and the less information we preserve from x. We can then use this diffusion-induced mixture distribution to train a timestep-dependent discriminator D that distinguishes between real and generated noisy samples, and a generator G that matches the distribution of generated noisy samples to the distribution of real noisy samples. Next we introduce Diffusion-GAN that trains its discriminator and generator with the help of the diffusion-induced mixture distribution. Published as a conference paper at ICLR 2023 3.2 ADVERSARIAL TRAINING The Diffusion-GAN trains its generator and discriminator by solving a min-max game objective as V (G, D) = Ex p(x),t pπ,y q(y | x,t)[log(Dϕ(y, t))] + Ez p(z),t pπ,yg q(y | Gθ(z),t)[log(1 Dϕ(yg, t))]. (3) Here, p(x) is the true data distribution, pπ is a discrete distribution that assigns different weights πt to each diffusion step t {1, . . . , T}, and q(y | x, t) is the conditional distribution of the perturbed sample y given the original data x and the diffusion step t. By Equation (2), with Gaussian reparameterization, the perturbation function could be written as y = αtx + 1 αtσϵ, where 1 αt = 1 Qt s=1 αs is the cumulative noise level at step t, σ is a scale factor, and ϵ N(0, I) is a Gaussian noise. The objective function in Equation (3) encourages the discriminator to assign high probabilities to the perturbed real data and low probabilities to the perturbed generated data, for any diffusion step t. The generator, on the other hand, tries to produce samples that can deceive the discriminator at any diffusion step t. Note that the perturbed generated sample yg q(y | Gθ(z), t) can be rewritten as yg = αt Gθ(z) + p (1 αt)σϵ, ϵ N(0, I). This means that the objective function in Equation (3) is differentiable with respect to the generator parameters, and we can use gradient descent to optimize it with back-propagation. The objective function Equation (3) is similar to the one used by the original GAN (Goodfellow et al., 2014), except that it involves the diffusion steps and the perturbation functions. We can show that this objective function also minimizes an approximation of the Jensen Shannon (JS) divergence between the true and the generated distributions, but with respect to the perturbed samples and the diffusion steps, as follows: DJS(p(y, t)||pg(y, t)) = Et pπ[DJS(p(y | t)||pg(y | t))]. (4) The JS divergence measures the dissimilarity between two probability distributions, and it reaches its minimum value of zero when the two distributions are identical. The proof of the equality in Equation (4) is given in Appendix C. A natural question that arises from this result is whether minimizing the JS divergence between the perturbed distributions implies minimizing the JS divergence between the original distributions, i.e., whether the optimal generator for Equation (3) is also the optimal generator for DJS(p(x)||pg(x)). We will answer this question affirmatively and provide a theoretical justification in Section 3.4. 3.3 ADAPTIVE DIFFUSION With the help of the perturbation function and timestep dependency, we have a new strategy to optimize the discriminator. We want the discriminator D to have a challenging task, neither too easy to allow overfitting the data (Karras et al., 2020a; Zhao et al., 2020) nor too hard to impede learning. Therefore, we adjust the intensity of the diffusion process, which adds noise to both y and yg, depending on how much D can distinguish them. When the diffusion step t is larger, the noise-to-data ratios are higher and the task is harder. We use 1 αt to measure the intensity of the diffusion, which increases as t grows. To control the diffusion intensity, we adaptively modify the maximum number of steps T. Our strategy is to make the discriminator learn from the easiest samples first, which are the original data samples, and then gradually increase the difficulty by feeding it samples from larger t. To do this, we use a self-paced schedule for T, which depends on a metric rd that estimates how much the discriminator overfits to the data: rd = Ey,t p(y,t)[sign(Dϕ(y, t) 0.5)], T = T + sign(rd dtarget) C, (5) where rd is the same as in Karras et al. (2020a) and C is a constant. We calculate rd and update T every four minibatches. We have two options for the distribution pπ that we use to sample t for the diffusion process: ( uniform: Discrete 1 T , . . . , 1 priority: Discrete 1 PT t=1 t, 2 PT t=1 t, . . . , T PT t=1 t The priority option gives more weight to larger t, which means the discriminator will see more new samples from the new steps when T increases. This is because we want the discriminator to focus on Published as a conference paper at ICLR 2023 the new and harder samples that it has not seen before, as this indicates that it is confident about the easier ones. Note that even with the priority option, the discriminator can still see samples from smaller t, because q(y | x) is a mixture of Gaussians that covers all steps of the diffusion chain. To avoid sudden changes in T during training, we use an exploration list tepl that contains t values sampled from pπ. We keep tepl fixed until we update T, and we sample t from tepl to generate noisy samples for the discriminator. This way, the model can explore each t sufficiently before moving to a higher T. We give the details of training Diffusion-GAN in Algorithm 1 in Appendix F. 3.4 THEORETICAL ANALYSIS WITH EXAMPLES To better understand the theoretical properties of our proposed method, we present two theorems that address two important questions about the use of diffusion-based instance noise injection for training GANs. The proofs of these theorems are deferred to Appendix B. The first question, denoted as (a), is whether adding noise to the real and generated samples in a diffusion process can facilitate the learning. The second question, denoted as (b), is whether minimizing the JS divergence between the joint distributions of the noisy samples and the noise levels, p(y, t) and pg(y, t), can lead to the same optimal generator as minimizing the JS divergence between the original distributions of the real and generated samples, p(x) and pg(x). To answer (a), we prove that for any choice of noise level t and any choice of convex function f, the f-divergence (Nowozin et al., 2016) between the marginal distributions of the noisy real and generated samples, q(y | t) and q(yg | t), is a smooth function that can be computed and optimized by the discriminator. This implies that the diffusion-based noise injection does not introduce any singularity or discontinuity in the objective function of the GAN. The JS divergence is a special case of f-divergence, where f(u) = log(2u) log(2 2u). Theorem 1 (Valid gradients anywhere for GANs training). Let p(x) be a fixed distribution over X and z be a random noise over another space Z. Denote Gθ : Z X as a function with parameter θ and input z and pg(x) as the distribution of Gθ(z). Let q(y | x, t) = N(y; αtx, (1 αt)σ2I), where αt (0, 1) and σ > 0. Let q(y | t) = R p(x)q(y | x, t)dx and qg(y | t) = R pg(x)q(y | x, t)dx. Then, t, if function Gθ is continuous and differentiable, the f-divergence Df(q(y | t)||qg(y | t)) is continuous and differentiable with respect to θ. Theorem 1 shows that with the help of diffusion noise injection by q(y | x, t), t, y and yg are defined on the same support space, the whole X, and Df(q(y | t)||qg(y | t)) is continuous and differentiable everywhere. Then, one natural question is what if Df(q(y | t)||qg(y | t)) keeps a near constant value and hence provides little useful gradient. Hence, we empirically show that by injecting noise through a mixture defined over all steps of the diffusion chain, there is always a good chance that a sufficiently large t is sampled to provide a useful gradient, via the toy example below. Toy example. We use the same simple example from Arjovsky et al. (2017) to illustrate our method. Let x = (0, z) be the real data and xg = (θ, z) be the data generated by a one-parameter generator, where z is a uniform random variable in [0, 1]. The JS divergence between the real and the generated distributions, DJS(p(x)||p(xg)), is discontinuous: it is log 2 when θ = 0 and zero otherwise, so it does not provide a useful gradient to guide θ towards zero. We introduce diffusion-based noise to both the real and the generated data, as shown in the first row of Figure 2. The noisy data, y and yg, have supports that cover the whole space R2 and their densities overlap more or less depending on the diffusion step t. In the second row, left, of Figure 2, we plot how the JS divergence between the noisy distributions, DJS(q(y | t)||qg(y | t)), varies with θ for different t values. The black line with t = 0 is the original JS divergence, which has a discontinuity at θ = 0. As t increases, the JS divergence curves become smoother and have nonzero gradients for a larger range of θ. However, some values of t, such as t = 200 in this example, still have flat regions where the JS divergence is nearly constant. To avoid this, we use a mixture of all steps to ensure that there is always a high chance of getting informative gradients. For the discriminator optimization, as shown in the second row, right, of Figure 2, the optimal discriminator under the original JS divergence is discontinuous and unattainable. With diffusionbased noise, the optimal discriminator changes with t: a smaller t makes it more confident and a larger t makes it more cautious. Thus the diffusion acts like a scale to balance the power of the Published as a conference paper at ICLR 2023 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 DJS(q(y|t)||qg(y|t)) t = 0 t = 200 t = 400 t = 500 t = 600 t = 800 2 0 2 4 6 x Optimal Discriminator Value D * (x) t = 0 t = 200 t = 400 t = 500 t = 600 t = 800 Figure 2: The toy example inherited from Arjovsky et al. (2017). The first row plots the distributions of data with diffusion noise injected for t. The second row shows the JS divergence and the optimal discriminator value with and without our noise injection. discriminator. This suggests the use of a differentiable forward diffusion chain that can provide various levels of gradient smoothness to help the generator training. Theorem 2 (Non-leaking noise injection). Let x p(x), y q(y | x) and xg pg(x), yg q(yg | xg), where q(y | x) is the transition density. Given certain q(y | x), if y could be reparameterized into y = f(x) + h(ϵ), ϵ p(ϵ), where p(ϵ) is a known distribution, and both f and h are one-to-one mapping functions, then we could have p(y) = pg(y) p(x) = pg(x). To answer question (b), we present Theorem 2, which shows a sufficient condition for the equality of the original and the augmented data distributions. By Theorem 2, the function f maps each x to a unique y, the function h maps each ϵ to a unique noise term, and the distribution of ϵ is known and independent of x. Under these assumptions, the theorem proves that the distribution of y is the same as the distribution of yg, if and only if the distribution of x is the same as the distribution of xg. If we take y | t as the y introduced in the theorem, then for t, Equation (2) fits the assumption made. This means that, by minimizing the divergence between q(y | t) and qg(y | t), which is the same as minimizing the divergence between p(x) | t and pg(x) | t, we are also minimizing the divergence between p(x) and pg(x). This implies that the noise injection does not affect the quality of the generated samples, and we can safely use our noise injection to improve the training of the generative model. 3.5 RELATED WORK The proposed Diffusion-GAN can be related to previous works on stabilizing the GAN training, building diffusion-based generative models, and constructing differential augmentation for dataefficient GAN training. A detailed discussion on these related works is deferred to Appendix A. 4 EXPERIMENTS We conduct extensive experiments to answer the following questions: (a) Will Diffusion-GAN outperform state-of-the-art GAN baselines on benchmark datasets? (b) Will the diffusion-based noise injection help the learning of GANs in domain-agnostic tasks? (c) Will our method improve the performance of data-efficient GANs trained with a very limited amount of data? Datasets. We conduct experiments on image datasets ranging from low-resolution (e.g., 32 32) to high-resolution (e.g., 1024 1024) and from low-diversity to high-diversity: CIFAR-10 (Krizhevsky, 2009), STL-10 (Coates et al., 2011), LSUN-Bedroom (Yu et al., 2015), LSUN-Church (Yu et al., 2015), AFHQ(Cat/Dog/Wild) (Choi et al., 2020), and FFHQ (Karras et al., 2019). More details on these benchmark datasets are provided in Appendix E. Evaluation protocol. We measure image quality using FID (Heusel et al., 2017). Following Karras et al. (2019; 2020b), we measure FID using 50k generated samples, with the full training set used Published as a conference paper at ICLR 2023 Table 1: Image generation results on benchmark datasets: CIFAR-10, Celeb A, STL-10, LSUN-Bedroom, LSUN-Church, and FFHQ. We highlight the best and second best results in each column with bold and underline, respectively. Lower FIDs indicate better fidelity, while higher Recalls indicate better diversity. CIFAR-10 Celeb A STL-10 LSUN-Bedroom LSUN-Church FFHQ (32 32) (64 64) (64 64) (256 256) (256 256) (1024 1024) FID Recall FID Recall FID Recall FID Recall FID Recall FID Recall Style GAN2 (Karras et al., 2020a) 8.32 0.41 2.32 0.55 11.70 0.44 3.98 0.32 3.93 0.39 4.41 0.42 Style GAN2 + Diff Aug (Zhao et al., 2020) 5.79 0.42 2.75 0.52 12.97 0.39 4.25 0.19 4.66 0.33 4.46 0.41 Style GAN2 + ADA (Karras et al., 2020a) 2.92 0.49 2.49 0.53 13.72 0.36 7.89 0.05 4.12 0.18 4.47 0.41 Diffusion Style GAN2 3.19 0.58 1.69 0.67 11.43 0.45 3.65 0.32 3.17 0.42 2.83 0.49 (a) CIFAR-10 (b) Celeb A (d) LSUN-Bedroom (e) LSUN-Church Figure 3: Randomly generated images from Diffusion Style GAN2 trained on CIFAR-10, Celeb A, STL-10, LSUN-Bedroom, LSUN-Church, and FFHQ datasets. as reference. We use the number of real images shown to the discriminator to evaluate convergence (Karras et al., 2020a; Sauer et al., 2021). Unless specified otherwise, all models are trained with 25 million images to ensure convergence (these trained with more or fewer images are specified in table captions). We further report the improved Recall score introduced by Kynk a anniemi et al. (2019) to measure the sample diversity of generative models. Implementations and resources. We build Diffusion-GANs based on the code of Style GAN2 (Karras et al., 2020b), Projected GAN (Sauer et al., 2021), and Ins Gen (Yang et al., 2021) to answer questions (a), (b), and (c), respectively. Diffusion GANs inherit from their corresponding base GANs all their network architectures and training hyperparamters, whose details are provided in Appendix G. Specifically for Style GAN2 and Ins Gen, we construct the discriminator as Dϕ(y, t), where t is injected via their mapping network. For Projected GAN, we empirically find t in the discriminator could be ignored to simplify the implementation and minimize the modifications to Projected GAN. More implementation details are provided in Appendix H. By applying our diffusionbased noise injection, we denote our models as Diffusion Style GAN2/Projected GAN/Ins Gen. In the following experiments, we train related models with their official code if the results are unavailable, while others are all reported from references and marked with . We run all our experiments with either 4 or 8 NVIDIA V100 GPUs depending on the demands of the inherited training configurations. 4.1 COMPARISON TO STATE-OF-THE-ART GANS We compare Diffusion-GAN with its state-of-the-art GAN backbone, Style GAN2 (Karras et al., 2020a), and to evaluate its effectiveness from the data augmentation perspective, we compare it with both Style GAN2 + Diff Aug (Zhao et al., 2020) and Style GAN2 + ADA (Karras et al., 2020a), in terms of both sample fidelity (FID) and sample diversity (Recall) over extensive benchmark datasets. We present the quantitative and qualitative results in Table 1 and Figure 3. Qualitatively, these generated images from Diffusion Style GAN2 are all photo-realistic and have good diversity, ranging from low-resolution (32 32) to high-resolution (1024 1024). Additional randomly generated images Published as a conference paper at ICLR 2023 1M 2M 5M 10M 15M 20M 25M Training Progress (# million real images) T Schedule on CIFAR-10 and STL-10 T of Diffusion Style GAN2 on CIFAR-10 T of Diffusion Style GAN2 on STL-10 T of Diffusion Projected GAN on CIFAR-10 T of Diffusion Projected GAN on STL-10 0M 5M 10M 15M 20M 25M Training Progress (# million real images) Discriminator outputs on CIFAR-10 Real images Generated images Figure 4: Plot of adaptively adjusted maximum diffusion steps T and discriminator outputs of Diffusion-GANs. can be found in Appendix L. Quantitatively, Diffusion Style GAN2 outperforms all the GAN baselines in generation diversity, as measured by Recall, on all 6 benchmark datasets and outperforms them in FID by a clear margin on 5 out of the 6 benchmark datasets. From the data augmentation perspective, we observe that Diffusion Style GAN2 always clearly outperforms the backbone model Style GAN2 across various datasets, which empirically validates our Theorem 2. By contrast, both the ADA (Karras et al., 2020b) and Diffaug (Zhao et al., 2020) techniques could sometimes impair the generation performance on sufficiently large datasets, e.g., LSUN-Bedroom and LSUN-Church, which is also observed by Yang et al. (2021) on FFHQ. This is possibly because their risk of leaking augmentation overshadows the benefits of data augmentation. To investigate how the adaptive diffusion process works during training, we illustrate in Figure 4 the convergence of the maximum timestep T in our adaptive diffusion and discriminator outputs. We see that T is adaptively adjusted: The T for Diffusion Style GAN2 increases as the training goes while the T for Diffusion Projected GAN first goes up and then goes down. Note that the T is adjusted according to the overfitting status of the discriminator. The second panel shows that trained with the diffusion-based mixture distribution, the discriminator is always well behaved and provides useful learning signals for the generator, which validates our analysis in Section 3.4 and Theorem 1. Memory and time costs. Generally speaking, the memory and time costs of a Diffusion-GAN are comparable to those of the corresponding GAN baseline. More specifically, switching from ADA (Karras et al., 2020a) to our diffusion-based augmentation, the added memory cost is negative, the added training time cost is negative, and the added inference time cost is zero. For example, for CIFAR-10, with four NVIDIA V100 GPUs, the training time for each 4k images is around 8.0s for Style GAN2, 9.8s for Style GAN2-ADA, and 9.5s for Diffusion-Style GAN2. 4.2 EFFECTIVENESS OF DIFFUSION-GAN FOR DOMAIN-AGNOSTIC AUGMENTATION To verify whether our method is domain-agnostic, we apply Diffusion-GAN onto the input feature vectors of GANs. We conduct experiments on both low-dimensional and high-dimensional feature vectors, for which commonly used image augmentation methods are no longer applicable. 25-Gaussians Example. We conduct experiments on the popular 25-Gaussians generation task. The 25-Gaussians dataset is a 2-D toy data, generated by a mixture of 25 two-dimensional Gaussian distributions. Each data point is a 2-dimensional feature vector. We train a small GAN model, whose generator and discriminator are both parameterized by multilayer perceptrons (MLPs), with two 128-unit hidden layers and Leaky Re Lu nonlinearities. The training results are shown in Figure 5. We observe that the vanilla GAN exhibits severe mode collapsing, capturing only a few modes. Its discriminator outputs of real and fake samples depart from each other very quickly. This implies a strong overfitting of the discriminator happened so that the discriminator stops providing useful learning signals for the generator. However, Diffusion GAN successfully captures all the 25 Gaussian modes and the discriminator is under control to continuously provide useful learning signals. We interpret the improvement from two perspectives: First, non-leaking augmentation helps provide more information about the data space; Second, the discriminator is well behaved given the adaptively adjusted diffusion-based noise injection. Projected GAN. To verify that our adaptive diffusion-based noise injection could benefit the learning of GANs on high-dimensional feature vectors, we directly apply it to the discriminator feature Published as a conference paper at ICLR 2023 1.5 1.0 0.5 0.0 0.5 1.0 1.5 x Groud Truth 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 x 0 1000 2000 3000 4000 5000 Epochs Discriminator outputs of GAN Real samples Fake samples 1.5 1.0 0.5 0.0 0.5 1.0 1.5 x Diffusion GAN 0 1000 2000 3000 4000 5000 Epochs Discriminator outputs of Diffusion GAN Real samples Fake samples Figure 5: The 25-Gaussians example. We show the true data samples, the generated samples from vanilla GANs, the discriminator outputs of the vanilla GANs, the generated samples from our Diffusion-GAN, and the discriminator outputs of Diffusion-GAN. space of Projected GAN (Sauer et al., 2021). Projected GANs generally leverage pre-trained neural networks to extract meaningful features for the adversarial learning of the discriminator and generator. Following Sauer et al. (2021), we adaptively diffuse the feature vectors extracted by Efficient Net-v0 and keep all the other training parts unchanged. We report the performance of Diffusion Projected GAN on several benchmark datasets in Table 2, which verifies that our augmentation method is domain-agnostic. Under the Projected GAN framework, we see that with noise properly injected into the high-dimensional feature space, Diffusion Projected GAN shows clear improvement in terms of both FID and Recall. We reach state-of-the-art FID results with Diffusion Projected GAN on STL-10 and LSUN-Bedroom/Church datasets. Table 2: Domain-agnostic experiments on Projected GAN. Domain-agnostic Tasks CIFAR-10 (32 32) STL-10 (64 64) LSUN-Bedroom (256 256) LSUN-Church (256 256) FID Recall FID Recall FID Recall FID Recall Projected GAN (Sauer et al., 2021) 3.10 0.45 7.76 0.35 2.25 0.55 3.42 0.56 Diffusion Projected GAN 2.54 0.45 6.91 0.35 1.43 0.58 1.85 0.65 4.3 EFFECTIVENESS OF DIFFUSION-GAN FOR LIMITED DATA We evaluate whether Diffusion-GAN can provide data-efficient GAN training. We first generate five FFHQ (1024 1024) dataset splits, consisting of 200, 500, 1k, 2k, and 5k images, respectively, where 200 and 500 images are considered to be extremely limited data cases. We also consider AFHQ-Cat, -Dog, and -Wild (512 512), each with as few as around 5k images. Motivated by the success of Ins Gen (Yang et al., 2021) on small datasets, we build our Diffusion-GAN upon it. We note on limited data, Ins Gen convincingly outperforms both Style GAN2+ADA and +Diff Aug, and currently holds the state-of-the-art performance for data-efficient GAN training. The results in Table 3 show that our Diffusion-GAN method can help further boost the performance of Ins Gen in limited data settings. Table 3: FFHQ (1024 1024) FID results with 200, 500, 1k, 2k, and 5k training samples; AFHQ (512 512) FID results. To ensure convergence, all models are trained across 10M images for FFHQ and 25M images for AFHQ. We bold the best number in each column. Models FFHQ (200) FFHQ (500) FFHQ (1k) FFHQ (2k) FFHQ (5k) Cat Dog Wild Ins Gen (Yang et al., 2021) 102.58 54.762 34.90 18.21 9.89 2.60 5.44 1.77 Diffusion Ins Gen 63.34 50.39 30.91 16.43 8.48 2.40 4.83 1.51 5 CONCLUSION We present Diffusion-GAN, a novel GAN framework that uses a variable-length forward diffusion chain with a Gaussian mixture distribution to generate instance noise for GAN training. This approach enables modeland domain-agnostic differentiable augmentation that leverages the advantages of diffusion without requiring a costly reverse diffusion chain. We prove theoretically and demonstrate empirically that Diffusion-GAN can prevent discriminator overfitting and provide non-leaking augmentation. We also demonstrate that Diffusion-GAN can produce high-resolution photo-realistic images with high fidelity and diversity, outperforming its corresponding state-of-theart GAN baselines on standard benchmark datasets according to both FID and Recall. Published as a conference paper at ICLR 2023 ACKNOWLEDGEMENTS Z. Wang, H. Zheng, and M. Zhou acknowledge the support of NSF-IIS 2212418 and IFML. Mart ın Arjovsky and L eon Bottou. Towards principled methods for training generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https: //openreview.net/forum?id=Hk4_qw5xe. Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 214 223, 2017. Marc G Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and R emi Munos. The Cramer distance as a solution to biased Wasserstein gradients. ar Xiv preprint ar Xiv:1705.10743, 2017. David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017. Ashish Bora, Eric Price, and Alexandros G. Dimakis. Ambient GAN: Generative models from lossy measurements. In International Conference on Learning Representations, 2018. URL https: //openreview.net/forum?id=Hy7f Dog0b. Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096, 2018. Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Star GAN v2: Diverse image synthesis for multiple domains. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8185 8194, 2020. Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215 223. JMLR Workshop and Conference Proceedings, 2011. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009. Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced Wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3483 3491, 2018. Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id= AAWu Cvza Vt. Jeff Donahue, Philipp Kr ahenb uhl, and Trevor Darrell. Adversarial feature learning. ar Xiv preprint ar Xiv:1605.09782, 2016. Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. ar Xiv preprint ar Xiv:1606.00704, 2016. William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M Dai, Shakir Mohamed, and Ian Goodfellow. Many paths to equilibrium: GANs do not need to decrease a divergence at every step. In International Conference on Learning Representations, 2018. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672 2680, 2014. Published as a conference paper at ICLR 2023 Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5767 5777, 2017. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626 6637, 2017. Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models. Ar Xiv, abs/2006.11239, 2020a. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020b. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401 4410, 2019. Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33:12104 12114, 2020a. Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of Style GAN. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110 8119, 2020b. Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. Co RR, abs/1312.6114, 2014. Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. ar Xiv preprint ar Xiv:2106.00132, 2021. Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. Tuomas Kynk a anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019. Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab as P oczos. MMD GAN: Towards deeper understanding of moment matching network. Advances in neural information processing systems, 30, 2017a. Chunyuan Li, Hao Liu, Changyou Chen, Yuchen Pu, Liqun Chen, Ricardo Henao, and Lawrence Carin. Alice: Towards understanding adversarial learning for joint distribution matching. Advances in neural information processing systems, 30, 2017b. Bingchen Liu, Yizhe Zhu, Kunpeng Song, and Ahmed Elgammal. Towards faster and stabilized gan training for high-fidelity few-shot image synthesis. In International Conference on Learning Representations, 2020. Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. ar Xiv preprint ar Xiv:2101.02388, 2021. Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of GANs. Advances in neural information processing systems, 30, 2017. Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? In International conference on machine learning, pages 3481 3490. PMLR, 2018. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018a. URL https://openreview.net/forum?id=B1QRgzi T-. Published as a conference paper at ICLR 2023 Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018b. Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc Grew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. ar Xiv preprint ar Xiv:2112.10741, 2021. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, 2016. Kushagra Pandey, Avideep Mukherjee, Piyush Rai, and Abhishek Kumar. Diffusevae: Efficient, controllable and high-fidelity generation from low-dimensional latents. ar Xiv preprint ar Xiv:2201.00308, 2022. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. Co RR, abs/1511.06434, 2016. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with CLIP latents. ar Xiv preprint ar Xiv:2204.06125, 2022. Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization. Advances in neural information processing systems, 30, 2017. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pages 2234 2242, 2016. Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion models. ar Xiv preprint ar Xiv:2104.02600, 2021. Axel Sauer, Kashyap Chitta, Jens M uller, and Andreas Geiger. Projected GANs converge faster. Advances in Neural Information Processing Systems, 34, 2021. Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. Ar Xiv, abs/1503.03585, 2015. Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz ar. Amortised MAP inference for image super-resolution. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https://openreview.net/forum?id=S1RP6GLle. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. ar Xiv preprint ar Xiv:2010.02502, 2020. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021a. URL https://openreview.net/ forum?id=St1giar CHLP. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11918 11930, 2019. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b. URL https://openreview.net/ forum?id=Px TIG12RRHS. Ngoc-Trung Tran, Viet-Hung Tran, Ngoc-Bao Nguyen, Trung-Kien Nguyen, and Ngai-Man Cheung. On data augmentation for gan training. IEEE Transactions on Image Processing, 30:1882 1897, 2021. Published as a conference paper at ICLR 2023 Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Highresolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798 8807, 2018. Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion gans. ar Xiv preprint ar Xiv:2112.07804, 2021. Ceyuan Yang, Yujun Shen, Yinghao Xu, and Bolei Zhou. Data-efficient instance generation from instance discrimination. Advances in Neural Information Processing Systems, 34:9378 9390, 2021. Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. ar Xiv preprint ar Xiv:1506.03365, 2015. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack GAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907 5915, 2017. Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International Conference on Machine Learning, pages 7354 7363. PMLR, 2019. Han Zhang, Zizhao Zhang, Augustus Odena, and Honglak Lee. Consistency regularization for generative adversarial networks. In International Conference on Learning Representations, 2020a. URL https://openreview.net/forum?id=S1lx Kl SKPH. Hao Zhang, Bo Chen, Long Tian, Zhengjue Wang, and Mingyuan Zhou. Variational hetero-encoder randomized GANs for joint image-text modeling. In International Conference on Learning Representations, 2020b. URL https://openreview.net/forum?id=H1x5w RVtv S. Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. Differentiable augmentation for data-efficient GAN training. Advances in Neural Information Processing Systems, 33:7559 7570, 2020. Huangjie Zheng and Mingyuan Zhou. Exploiting chain rule and Bayes theorem to compare probability distributions. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview. net/forum?id=f-gg KIDTu5D. Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Truncated diffusion probabilistic models. ar Xiv preprint ar Xiv:2202.09671, 2022. Published as a conference paper at ICLR 2023 A RELATED WORK Stabilizing GAN training. A root cause of training difficulties in GANs is often attributed to the JS divergence that GANs intend to minimize. This is because when the data and generator distributions have non-overlapping supports, which are often the case for high-dimensional data supported by low-dimensional manifolds, the gradient of the JS divergence may provide no useful guidance to optimize the generator (Arjovsky and Bottou, 2017; Arjovsky et al., 2017; Mescheder et al., 2018; Roth et al., 2017). For this reason, Arjovsky et al. (2017) propose to instead use the Wasserstein-1 distance, which in theory can provide useful gradient for the generator even if the two distributions have disjoint supports. However, Wasserstein GANs often require the use of a critic function under the 1-Lipschitz constraint, which is difficult to satisfy in practice and hence realized with heuristics such as weight clipping (Arjovsky et al., 2017), gradient penalty (Gulrajani et al., 2017), and spectral normalization (Miyato et al., 2018a). While the divergence minimization perspective has played an important role in motivating the construction of Wasserstein GANs and gradient penalty-based regularizations, cautions should be made on purely relying on it to understand GAN training, due to not only the discrepancy between the divergence in theory and the actual min-max objective function used in practice, but also the potential confounding between different divergences and different training and regularization strategies (Fedus et al., 2018; Mescheder et al., 2018). E.g., Mescheder et al. (2018) have provided a simple example where in theory the Wasserstein GAN is predicted to succeed while the vanilla GAN is predicted to fail, but in practice the Wasserstein GAN with a finite number of discriminator updates per generator update fails to converge while the vanilla GAN with the non-saturating loss can slowly converge. Fedus et al. (2018) provide a rich set of empirical evidence to discourage viewing GANs purely from the perspective of minimizing a specific divergence at each training step and emphasize the important role played by gradient penalties on stabilizing GAN training. Diffusion models. Due to the use of a forward diffusion chain, the proposed Diffusion-GAN can be related to diffusion-based (or score-based) deep generative models (Ho et al., 2020b; Sohl Dickstein et al., 2015; Song and Ermon, 2019) that employ both a forward (inference) and a reverse (generative) diffusion chain. These diffusion-based generative models are stable to train and can generate high-fidelity photo-realistic images (Dhariwal and Nichol, 2021; Ho et al., 2020b; Nichol et al., 2021; Ramesh et al., 2022; Song and Ermon, 2019; Song et al., 2021b). However, they are notoriously slow in generation due to the need to traverse the reverse diffusion chain, which involves going through the same U-Net-based generator network hundreds or even thousands of times (Song et al., 2021a). For this reason, a variety of methods have been proposed to reduce the generation cost of diffusion-based generative models (Kong and Ping, 2021; Luhman and Luhman, 2021; Pandey et al., 2022; San-Roman et al., 2021; Song et al., 2021a; Xiao et al., 2021; Zheng et al., 2022). A key distinction is that Diffusion-GAN needs a reverse diffusion chain during neither training nor generation. More specifically, its generator maps the noise to a generated sample in a single step. Diffusion-GAN can train and generate as quickly as a vanilla GAN does with the same generator size. For example, it takes around 20 hours to sample 50k images of size 32 32 from a DDPM (Ho et al., 2020b) on an Nvidia 2080 Ti GPU, but would take less than a minute to do so from Diffusion-GAN. Differentiable augmentation. As Diffusion-GAN transforms both the data and generated samples before sending them to the discriminator, we can also relate it to differentiable augmentation (Karras et al., 2020a; Zhao et al., 2020) proposed for data-efficient GAN training. Karras et al. (2020a) introduce a stochastic augmentation pipeline with 18 transformations and develop an adaptive mechanism for controlling the augmentation probability. Zhao et al. (2020) propose to use Color + Translation + Cutout as differentiable augmentations for both generated and real images. While providing good empirical results on some datasets, these augmentation methods are developed with domain-specific knowledge and have the risk of leaking augmentation into generation (Karras et al., 2020a). As observed in our experiments, they sometime worsen the results when applied to a new dataset, likely because the risk of augmentation leakage overpowers the benefits of enlarging the training set, which could happen especially if the training set size is already sufficiently large. Published as a conference paper at ICLR 2023 By contrast, Diffusion-GAN uses a differentiable forward diffusion process to stochastically transform the data and can be considered as both a domain-agnostic and a model-agnostic augmentation method. In other words, Diffusion-GAN can be applied to non-image data or even latent features, for which appropriate data augmentation is difficult to be defined, and easily plugged into an existing GAN to improve its generation performance. Moreover, we prove in theory and show in experiments that augmentation leakage is not a concern for Diffusion-GAN. Tran et al. (2021) provide a theoretical analysis for deterministic non-leaking transformation with differentiable and invertible mapping functions. Bora et al. (2018) show similar theorems to us for specific stochastic transformations, such as Gaussian Projection, Convolve+Noise, and stochastic Block-Pixels, while our Theorem 2 includes more satisfying possibilities as discussed in Appendix B. Proof of Theorem 1. For simplicity, let x Pr, xg Pg, y Pr ,t, yg Pg ,t, at = αt and bt = (1 αt)σ2. Then, pr ,t(y) = Z X pr(x)N(y; atx, bt I)dx pg ,t(y) = Z X pg(x)N(y; atx, bt I)dx z p(z), xg = gθ(z), yg = atxg + btϵ, ϵ p(ϵ) Df(pr ,t(y)||pg ,t(y)) = Z X pg ,t(y)f pr ,t(y) = Ey pg ,t(y) = Ez p(z),ϵ p(ϵ) f pr ,t(atgθ(z) + btϵ) pg ,t(atgθ(z) + btϵ) Since N(y; atx, bt I) is assumed to be an isotropic Gaussian distribution, for simplicity, in what follows we show the proof in uni-variate Gaussian, which could be easily extended to multi-variate Gaussian by the production rule. We first show that under mild conditions, the pr ,t(y) and pg ,t(y) are continuous functions over y. lim y 0 pr ,t(y y) = lim y 0 X pr(x)N(y y; atx, bt)dx X pr(x) lim y 0 N(y y; atx, bt)dx X pr(x) lim y 0 1 C1 exp ((y y) atx)2 X pr(x)N(y; atx, bt)dx = pr ,t(y), where C1 and C2 are constants. Hence, pr ,t(y) is a continuous function defined on y. The proof of continuity for pg ,t(y) is exactly the same proof. Then, given gθ is also a continuous function, it is clear to see that Df(pr ,t(y)||pg ,t(y)) is a continuous function over θ. Next, we show that Df(pr ,t(y)||pg ,t(y)) is differentiable. By the chain rule, showing Df(pr ,t(y)||pg ,t(y)) to be differentiable is equivalent to show pr ,t(y), pr ,t(y) and f are differentiable. Usually, f is defined with differentiability (Nowozin et al., 2016). θpr ,t(atgθ(z) + btϵ) = θ X pr(x)N(atgθ(z) + btϵ; atx, bt)dx C1 θ exp ||atgθ(z) + btϵ atx||2 2 C2 Published as a conference paper at ICLR 2023 θpg ,t(atgθ(z) + btϵ) = θ X pg(x)N(atgθ(z) + btϵ; atx, bt)dx = θEz p(z ) [N(atgθ(z) + btϵ; atgθ(z ), bt)] C1 θ exp ||atgθ(z) + btϵ atgθ(z )||2 2 C2 where C1 and C2 are constants. Hence, pr ,t(y) and pr ,t(y) are differentiable, which concludes the proof. Proof of Theorem 2. We have p(y) = R p(x)q(y | x)dx and pg(y) = R pg(x)q(y | x)dx. If p(x) = pg(x), then p(y) = pg(y) Let y p(y) and yg pg(y). Given the assumption on q(y | x), we have y = f(x) + g(ϵ), x p(x), ϵ p(ϵ) yg = f(xg) + g(ϵg), xg pg(x), ϵg p(ϵ). Since f and g are one-to-one mapping functions, f(x) and g(ϵ) are identifiable, which indicates f(x) D= f(xg) x D= xg. By the property of moment-generating functions (MGF), given f(x) is independent with g(ϵ), we have for s My(s) = Mf(x)(s) Mg(ϵ)(s) Myg(s) = Mf(xg)(s) Mg(ϵg)(s). where My(s) = Ey p(y)[es T y] denotes the MGF of random variable y and the others follow the same form. By the moment-generating function uniqueness theorem, given y D= yg and g(ϵ) D= g(ϵg), we have My(s) = Myg(s) and Mg(ϵ)(s) = Mg(ϵg)(s) for s. Then, we could obtain Mf(x) = Mf(xg) for s. Thus, Mf(x) = Mf(xg) f(x) D= f(xg) p(x) = p(xg), which concludes the proof. Discussion. Next, we discuss which q(y | x) fits the assumption we made on it. We follow the discussion of reparameterization of distributions as used in Kingma and Welling (2014). Three basic approaches are: 1. Tractable inverse CDF. In this case, let ϵ U(0, I), and ψ(ϵ, y, x) be the inverse CDF of q(y | x). From ψ(ϵ, y, x), if y = f(x) + g(ϵ), for example, y Cauchy(x, γ) and y Logistic(x, s), then Theorem 2 holds. 2. Analogous to the Gaussian example, y N(x, σ2I) y = x + σ ϵ, ϵ N(0, I). For any location-scale family of distributions we can choose the standard distribution (with location = 0, scale = 1) as the auxiliary variable ϵ, and let g(.) = location + scale ϵ. Examples: Laplace, Elliptical, Student s t, Logistic, Uniform, Triangular, and Gaussian distributions. 3. Implicit distributions. q(y | x) could be modeled by neural networks, which implies y = f(x) + g(ϵ), ϵ p(ϵ), where f and g are one-to-one nonlinear transformations. Published as a conference paper at ICLR 2023 C DERIVATIONS Derivation of equality in JSD JSD(p(y, t), pg(y, t)) p(y, t) + pg(y, t) p(y, t) + pg(y, t) 2Ey,t p(y,t) log 2 p(y, t) p(y, t) + pg(y, t) 2Ey,t pg(y,t) log 2 pg(y, t) p(y, t) + pg(y, t) 2Et pπ(t),y p(y | t) log 2 p(y | t)pπ(t) p(y | t)pπ(t) + pg(y | t)pπ(t) 2Et pπ(t),y pg(y | t) log 2 pg(y | t)pπ(t) p(y | t)pπ(t) + pg(y | t)pπ(t) 2Ey p(y | t) log 2 p(y | t) p(y | t) + pg(y | t) 2Ey pg(y | t) log 2 pg(y | t) p(y | t) + pg(y | t) = Et pπ(t)[JSD(p(y | t), pg(y | t))]. D DETAILS OF TOY EXAMPLE Here, we provide the detailed analysis of the JS divergence toy example. Notation. Let X be a compact metric set (such as the space of images [0, 1]d) and Prob(X) denote the space of probability measures defined on X. Let Pr be the target data distribution and Pg 1 be the generator distribution. The JSD between the two distributions Pr, Pg Prob(X) is defined as: DJS(Pr||Pg) = 1 2DKL(Pr||Pm) + 1 2DKL(Pg||Pm), (7) where Pm is the mixture (Pr + Pg)/2 and DKL denotes the Kullback-Leibler divergence, i.e., DKL(Pr||Pg) = R X pr(x) log( pr(x) pθ(x))dx. More generally, the f-divergence (Nowozin et al., 2016) between Pr and Pg is defined as: Df(Pr||Pg) = Z X pg(x)f pr(x) where the generator function f : R+ R is a convex and lower-semicontinuous function satisfying f(1) = 0. We refer to Nowozin et al. (2016) for more details. We recall the typical example introduced in Arjovsky and Bottou (2017) and follow the notations. Example. Let Z U[0, 1] be the uniform distribution on the unit interval. Let X Pr be the distribution of (0, Z) R2, which contains a 0 on the x-axis and a random variable Z on the y-axis. Let Xg Pg be the distribution of (θ, Z) R2, where θ is a single real parameter. In this case, the DJS(Pr||Pg) is not continuous, DJS(Pr||Pg) = 0 if θ = 0, log 2 if θ = 0. 1For notation simplicity, g and G both denote the generator network in GANs in this paper. Published as a conference paper at ICLR 2023 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 DJS(q(y|t)||qg(y|t)) t = 0 t = 200 t = 400 t = 500 t = 600 t = 800 2 0 2 4 6 x Optimal Discriminator Value D * (x) t = 0 t = 200 t = 400 t = 500 t = 600 t = 800 Figure 6: We show the data distribution and DJS(Pr||Pg). which can not provide a usable gradient for training. The derivation is as follows: DJS(Pr||Pg) = 1 log 2 pr(x) pr(x) + pg(x) log 2 pg(y) pr(y) + pg(y) 2Ex1=0,x2 U[0,1] log 2 1[x1 = 0] U(x2) 1[x1 = 0] U(x2) + 1[x1 = θ] U(x2) 2Ey1=θ,y2 U[0,1] log 2 1[y1 = θ] U(y2) 1[y1 = 0] U(y2) + 1[y1 = θ] U(y2) log 2 1[x1 = 0] 1[x1 = 0] + 1[x1 = θ] log 2 1[y1 = θ] 1[y1 = 0] + 1[y1 = θ] = 0 if θ = 0, log 2 if θ = 0. Although this simple example features distributions with disjoint supports, the same conclusion holds when the supports have a non empty intersection contained in a set of measure zero (Arjovsky and Bottou, 2017). This happens to be the case when two low dimensional manifolds intersect in general position (Arjovsky and Bottou, 2017). To avoid the potential issue caused by having non-overlapping distribution supports, a common remedy is to use Wasserstein-1 distance which in theory can still provide usable gradient (Arjovsky and Bottou, 2017; Arjovsky et al., 2017). In this case, the Wasserstein-1 distance is |θ|. Diffusion-based noise injection In general, with our diffusion noise injected, we could have, X pr(x)N(y; αtx, (1 αt)σ2I)dx X pg(x)N(y; αtx, (1 αt)σ2I)dx DJS(pr ,t||pg ,t) = 1 log 2pr ,t pr ,t + pg ,t log 2pg ,t pr ,t + pg ,t For the previous example, we have Y t and Y g,t such that, Y t = (y1, y2) pr ,t = N(y1 | 0, bt)f(y2), Y g,t = (yg,1, yg,2) pg ,t = N(yg,1 | atθ, bt)f(yg,2), where f( ) = R 1 0 N( | at Z, bt)U(Z)d Z, at and bt are abbreviations for αt and (1 αt)σ2. The supports of Y t and Y g,t are both the whole metric space R2 and they overlap with each other depending on t, as shown in Figure 2. As t increases, the high density region of Y t and Y g,t get closer Published as a conference paper at ICLR 2023 since the weight at is decreasing towards 0. Then, we derive the JS divergence, DJS(pr ,t||pg ,t) 2Ey1 N(y1 | 0,bt),y2 f(y2) log 2 N(y1 | 0, bt)f(y2) N(y1 | 0, bt)f(y2) + N(y1 | atθ, bt)f(y2) 2Eyg,1 N(yg,1 | 0,bt),yg,2 f(yg,2) log 2 N(yg,1 | atθ, bt)f(yg,2) N(yg,1 | 0, bt)f(yg,2) + N(yg,1 | atθ, bt)f(yg,2) 2Ey1 N(0,bt) log 2 N(y1 | 0, bt) N(y1 | 0, bt) + N(y1 | atθ, bt) 2Eyg,1 N(atθ,bt) log 2 N(yg,1 | atθ, bt) N(yg,1 | 0, bt) + N(yg,1 | atθ, bt) which is clearly continuous and differentiable. We show this DJS(pr ,t||pg ,t) with respect to increasing t values and a θ grid in the second row of Figure 2. As shown in the left panel, the black line with t = 0 shows the origianl JSD, which is not even continuous, while as the diffusion level t increments, the lines become smoother and flatter. It is clear to see that these smooth curves provide good learning signals for θ. Recall that the Wasserstein-1 distance is |θ| in this case. Meanwhile, we could observe with an intense diffusion, e.g., t = 800, the curve becomes flatter, which indicates smaller gradients and a much slower learning process. This motivates us that an adaptive diffusion could provide different level of gradient smoothness and is possibly better for training. The right panel shows the optimal discriminator outputs over the space X. With diffusion, the optimal discriminator is well defined over the space and the gradient is smooth, while without diffusion the optimal discriminator is only valid on two star points. Interestingly, we find that smaller t drives the optimal discriminator to become more assertive while larger t makes discriminator become more neutral. The diffusion here works like a scale to balance the power of the discriminator. E DATASET DESCRIPTIONS The CIFAR-10 dataset consists of 50k 32 32 training images in 10 categories. The STL-10 dataset originated from Image Net (Deng et al., 2009) consists of 100k unlabeled images in 10 categories, and we resize them to 64 64 resolution. For LSUN datasets, we sample 200k images from LSUNBedroom, use the whole 125k images from LSUN-Church, and resize them to 256 256 resolution for training. The AFHQ datasets includes around 5k 512 512 images per category for dogs, cats, and wild life; we train a separate network for each of them. The FFHQ contains 70k images crawled from Flickr at 1024 1024 resolution and we use all of them for training. F ALGORITHM We provide the Diffusion-GAN algorithm in Algorithm 1. G HYPERPARAMETERS Diffusion-GAN is built on GAN backbones, so we keep the learning hyperparameters of the original GAN backbones untouched. Diffusion-GAN introduces four new hyperparameters: noise standard deviation σ, Tmax, T increasing threshold dtarget, and t sampling distribution pπ. The σ is fixed as 0.05 for images (pixel values rescaled to [-1 ,1]) in all our experiments and it shows good performance. Tmax could be fixed as 500 or 1000, which depends on the diversity of the dataset. We recommend a large Tmax for diverse datasets. dtarget is usually fixed as 0.6, which does not influence much about the performance. pπ has two choices, uniform and priority . Generally, (σ = 0.05, Tmax = 500, dtarget = 0.6, pπ = uniform ) is a good starting point for a new dataset. In our experiment, we find Style GAN2-based models are not sensitive to the values of dtarget, so we set dtarget = 0.6 for them across all dataset, only except that we set dtarget = 0.8 for FFHQ Published as a conference paper at ICLR 2023 Algorithm 1 Diffusion-GAN while i number of training iterations do Step I: Update discriminator Sample minibatch of m noise samples {z1, z2, . . . , zm} pz(z). Obtain generated samples {xg,1, xg,2, . . . , xg,m} by xg = G(z). Sample minibatch of m data examples {x1, x2, . . . , xm} p(x). Sample {t1, t2, . . . , tm} from tepl list uniformly with replacement. For j {1, 2, . . . , m}, sample yj q(yj|xj, tj) and yg,j q(yg,j|xg,j, tj) Update discriminator by maximizing Equation (3). Step II: Update generator Sample minibatch of m noise samples {z1, z2, . . . , zm} pz(z) Obtain generated samples {xg,1, xg,2, . . . , xg,m} by xg = G(z). Sample {t1, t2, . . . , tm} from tepl list with replacement. For j {1, 2, . . . , m}, sample yg,j q(yg,j|xg,j, tj) Update generator by minimizing Equation (3). Step III: Update diffusion if i mod 4 == 0 then Update T by Equation (5) Sample tepl = [0, . . . , 0, t1, . . . , t32], where tk pπ for k {1, . . . , 32}. pπ is in Equation (6). {tepl has 64 dimensions.} end if end while Datasets rd CIFAR-10 (32 32, 50k images) 0.45 STL-10 (64 64, 100k images) 0.6 LSUN-Church (256 256, 120k images) 0.2 LSUN-Bedroom (256 256, 200k images) 0.2 Table 4: dtarget for Diffusion Projected GAN (dtarget = 0.8 for FFHQ is slightly better than 0.6 in FID). We report dtarget of Diffusion Projected GAN for our experiments in Table 4. We also evaluated two t sampling distribution pπ, [ priority , uniform ], defined in Equation (6). In most cases, priority works slightly better, while in some cases, such as FFHQ, uniform is better. Overall, we didn t modify anything in the model architectures and training hyperparameters, such as learning rate and batch size. The forward diffusion configuration and model training configurations are as follows. Diffusion config. For our diffusion-based noise injection, we set up a linearly increasing schedule for βt, where t {1, 2, . . . , T}. For pixel level injection in Style GAN2, we follow Ho et al. (2020b) and set β0 = 0.0001 and βT = 0.02. We adaptively modify T ranging from Tmin = 5 to Tmax = 1000. The image pixels are usually rescaled to [ 1, 1] so we set the Guassian noise standard deviation σ = 0.05. For feature level injection in Diffusion Projected GAN, we set β0 = 0.0001, βT = 0.01, Tmin = 5, Tmax = 500, and σ = 0.5. We list all these values in Table 5 Model config. For Style GAN2-based models, we borrow the config settings provided by Karras et al. (2020a), which include [ auto , stylegan2 , cifar , paper256 , paper512 , stylegan2 ]. We create the stl config based on cifar with a small modification that we change the gamma term to be 0.01. For Projected GAN models, we use the recommended default config (Sauer et al., 2021), which is based on Fast GAN (Liu et al., 2020). We report the config settings used for our experiments in Table 6. Published as a conference paper at ICLR 2023 Diffusion config for pixel, priority β0 = 0.0001, βT = 0.02, Tmin = 5, Tmax = 1000, σ = 0.05 Diffusion config for pixel, uniform β0 = 0.0001, βT = 0.02, Tmin = 5, Tmax = 500, σ = 0.05 Diffusion config for feature β0 = 0.0001, βT = 0.01, Tmin = 5, Tmax = 500, σ = 0.5 Table 5: Diffusion config. Dataset Models Config Specification CIFAR-10 (32 32) Style GAN2 cifar - Diffusion Style GAN2 cifar diffusion-pixel, dtarget = 0.6, priority Projected GAN default diffusion-feature Diffusion Projected GAN default diffusion-feature STL-10 (64 64) Style GAN2 stl - Diffusion Style GAN2 stl diffusion-pixel, dtarget = 0.6, priority Projected GAN default diffusion-feature Diffusion Projected GAN default diffusion-feature LSUN-Bedroom (256 256) Style GAN2 paper256 - Diffusion Style GAN2 paper256 diffusion-pixel, dtarget = 0.6, priority Projected GAN default diffusion-feature Diffusion Projected GAN default diffusion-feature LSUN-Church (256 256) Style GAN2 paper256 - Diffusion Style GAN2 paper256 diffusion-pixel, dtarget = 0.6, priority Projected GAN default diffusion-feature Diffusion Projected GAN default diffusion-feature AFHQ-Cat/Dog/Wild (512 512) Style GAN2 paper512 - Diffusion Style GAN2 paper512 diffusion-pixel, dtarget = 0.6, priority Ins Gen default - Diffusion Ins Gen paper512 diffusion-pixel, dtarget = 0.6, uniform FFHQ (1024 1024) Style GAN2 stylegan2 - Diffusion Style GAN2 stylegan2 diffusion-pixel, dtarget = 0.8, uniform Ins Gen default - Diffusion Ins Gen stylegan2 diffusion-pixel, dtarget = 0.6, uniform Table 6: The config setting of Style GAN2-based models and Projected GAN-based models. For Style GAN2based models, we borrow the config settings provided by Karras et al. (2020a), which includes [ auto , stylegan2 , cifar , paper256 , paper512 , paper1024 ]. We create the stl config based on cifar with small modifications that we change the gamma term to be 0.01. For Projected GAN models, we use the recommended default config (Sauer et al., 2021), which is based on Fast GAN. H IMPLEMENTATION DETAILS We implement an additional diffusion sampling pipeline, where the diffusion configurations are set in Appendix G. The T in the forward diffusion process is adaptively adjusted and clipped to [Tmin, Tmax]. As illustrated in Algorithm 1, at each update step, we sample t from tepl for each data point x, and then use the analytic Gaussian distribution at diffusion step t to sample y. Next, we use y and t instead of x for optimization. Diffusion Style GAN2. We inherit all the network architectures from Style GAN2 implemented by Karras et al. (2020a). We modify the original mapping network, which is there for label conditioning and unused for unconditional image generation tasks, inside the discriminator to inject t. Specifically, we change the original input of mapping network, the class label c, to our discrete value timestep t. Then, we train the generator and discriminator with diffused samples y and t. Diffuson Projected GAN. To simplify the implementation and minimize the modifications to Projected GAN, we construct the discriminator as Dϕ(y), where t is ignored. Our method is plugged in as a data augmentation method. The only change in the optimization stage is that the discriminator is fed with diffused images y instead of original images x. Diffuson Ins Gen. To simplify the implementation and minimize the modifications to Ins Gen, we keep their contrastive learning part untouched. We modify the original discriminator network Published as a conference paper at ICLR 2023 5M 10M 20M 30M 40M 50M Training Progress (# million real images) Ablation on the adaptive schedule 47 48 49 50 uniform-ada T uniform-no_ada T priority-ada T priority-no_ada T Figure 7: Ablation study on the T adaptiveness. to inject t similarly to Diffusion Style GAN2. Then, we train the generator and discriminator with diffused samples y and t. I ABLATION ON THE MIXING PROCEDURE AND T ADAPTIVENESS Note the mixing procedure described in Equation (6), referred to as priority mixing in what follows, is designed based on our intuition. Here we conduct an ablation study on the mixing procedure by comparing the priority mixing with uniform mixing on three representative datasets. We report in Table 7 the FID results, which suggest that uniform mixing could work better than priority mixing in some dataset, and hence Diffusion-GAN may be further improved by optimizing its mixing procedure according to the training data. While optimizing the mixing procedure is beyond the focus of this paper, it is worth further investigation in future studies. Table 7: Ablation study on the mixing procedure. Priority Mixing refers to the mixing procedure in Equation (6) and Uniform Mixing refers to sample t uniformly at random. CIFAR-10 STL-10 FFHQ Priority Mixing 3.19 11.43 3.22 Uniform Mixing 3.44 11.75 2.83 We further conduct ablation study on whether the T needs to be adaptively adjusted. As shown in Figure 7, we observe with adaptive diffusion strategy, the training curves of FIDs converge faster and reach lower final FIDs. J MORE GAN VARIANTS To further validate our noise injection via diffusion-based mixtures, we add our diffusion-based training into two more representative GAN variants: DCGAN (Radford et al., 2015) and SNGAN (Miyato et al., 2018b), which have quite different GAN architectures compared to Style GAN2. We provide the FIDs for CIFAR-10 in Table 8. We observe that both Diffusion-DCGAN and Diffusion SNGAN clearly outperform their corresponding baseline GANs. Table 8: FIDs on CIFAR-10 for DCGAN, Diffusion-DCGAN, SNGAN, and Diffusion-SNGAN. DCGAN (Radford et al., 2015) Diffusion-DCGAN SNGAN (Miyato et al., 2018b) Diffusion-SNGAN CIFAR-10 28.65 24.67 20.76 17.23 Published as a conference paper at ICLR 2023 Method IS FID Recall NFE DDPM (Ho et al., 2020a) 9.46 3.21 0.57 1000 DDIM (Song et al., 2020) 8.78 4.67 0.53 50 Denoising Diffusion GAN (Xiao et al., 2021) 9.63 3.75 0.57 4 Style GAN2 (Karras et al., 2020a) 9.18 8.32 0.41 1 Style GAN2 + Diff Aug (Zhao et al., 2020) 9.40 5.79 0.42 1 Style GAN2 + ADA (Karras et al., 2020a) 9.83 2.92 0.49 1 Diffusion Style GAN2 9.94 3.19 0.58 1 Table 9: Inception Score for CIFAR-10. For sampling time, we use the number of function evaluations (NFE). K INCEPTION SCORE FOR CIFAR-10 We report the Inception Score (IS) (Salimans et al., 2016) of Diffusion Style GAN2 for CIFAR-10 dataset in Table 9 and also include other state-of-the-art GANs and diffusion models as baselines. Note CIFAR-10 is a well-known dataset and tested by almost all baselines, so we pick CIFAR-10 here and we reference the reported IS values from their original papers for a fair comparison. L MORE GENERATED IMAGES We provide more randomly generated images for LSUN-Bedroom, LSUN-Church, AFHQ, and FFHQ datasets in Figure 8, Figure 9, and Figure 10. Published as a conference paper at ICLR 2023 Figure 8: More generated images for LSUN-Bedroom (FID 1.43, Recall 0.58) and LSUN-Church (FID 1.85, Recall 0.65) from Diffusion Projected GAN. Published as a conference paper at ICLR 2023 Figure 9: More generated images for AFHQ-Cat (FID 2.40), AFHQ-Dog (FID 4.83) and AFHQ-Wild (FID 1.51) from Diffusion Ins Gen. Published as a conference paper at ICLR 2023 Figure 10: More generated images for FFHQ from Diffusion Style GAN2 (FID 3.71, Recall 0.43).