# elucidating_the_exposure_bias_in_diffusion_models__15c3d2bd.pdf Published as a conference paper at ICLR 2024 ELUCIDATING THE EXPOSURE BIAS IN DIFFUSION MODELS Mang Ning Utrecht University m.ning@uu.nl Mingxiao Li KU Leuven mingxiao.li@cs.kuleuven.be Jianlin Su Moonshot AI Ltd. bojone@spaces.ac.cn Albert Ali Salah Utrecht University a.a.salah@uu.nl Itir Onal Ertugrul Utrecht University i.onalertugrul@uu.nl Diffusion models have demonstrated impressive generative capabilities, but their exposure bias problem, described as the input mismatch between training and sampling, lacks in-depth exploration. In this paper, we investigate the exposure bias problem in diffusion models by first analytically modelling the sampling distribution, based on which we then attribute the prediction error at each sampling step as the root cause of the exposure bias issue. Furthermore, we discuss potential solutions to this issue and propose an intuitive metric for it. Along with the elucidation of exposure bias, we propose a simple, yet effective, training-free method called Epsilon Scaling to alleviate the exposure bias. We show that Epsilon Scaling explicitly moves the sampling trajectory closer to the vector field learned in the training phase by scaling down the network output, mitigating the input mismatch between training and sampling. Experiments on various diffusion frameworks (ADM, DDIM, EDM, LDM, Di T, PFGM++) verify the effectiveness of our method. Remarkably, our ADM-ES, as a state-of-the-art stochastic sampler, obtains 2.17 FID on CIFAR-10 under 100-step unconditional generation. The code is at https://github.com/forever208/ADM-ES 1 INTRODUCTION Due to the outstanding generation quality and diversity, diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song & Ermon, 2019) have achieved unprecedented success in image generation (Dhariwal & Nichol, 2021; Nichol et al., 2022; Rombach et al., 2022; Saharia et al., 2022), audio synthesis (Kong et al., 2021; Chen et al., 2021) and video generation (Ho et al., 2022). Unlike generative adversarial networks (GANs) (Goodfellow et al., 2014), variational autoencoders (VAEs) (Kingma & Welling, 2014) and flow-based models (Dinh et al., 2014; 2017), diffusion models stably learn the data distribution through a noise/score prediction objective and progressively removes noise from random initial vectors in the iterative sampling stage. A key feature of diffusion models is that good sample quality requires a long iterative sampling chain since the Gaussian assumption of reverse diffusion only holds for small step sizes (Xiao et al., 2022). However, Ning et al. (2023) claim that the iterative sampling chain also leads to the exposure bias problem (Ranzato et al., 2016; Schmidt, 2019). Concretely, given the noise prediction network ϵθϵθϵθ( ), exposure bias refers to the input mismatch between training and inference, where the former is always exposed to the ground truth training sample xt while the latter depends on the previously generated sample ˆxt. The difference between xt and ˆxt causes the discrepancy between ϵθϵθϵθ(xt) and ϵθϵθϵθ(ˆxt), which leads to the error accumulation and the sampling drift (Li et al., 2023a). We point out that the exposure bias problem in diffusion models lacks in-depth exploration. For example, there is no proper metric to quantify the exposure bias and no explicit error analysis for it. To shed light on exposure bias, we conduct a systematical investigation by first modelling the sampling distribution with prediction error. Based on our analysis, we find that the practical sampling distribution has a larger variance than the ground truth distribution at every single step, demonstrating the Published as a conference paper at ICLR 2024 analytic difference between xt in training and ˆxt in sampling. Along with the sampling distribution analysis, we propose a metric δt to evaluate exposure bias by comparing the variance difference between training and sampling. Finally, we discuss potential solutions to exposure bias, and propose a simple yet effective training-free and plug-in method called Epsilon Scaling to alleviate this issue. We test our approach on extensive diffusion frameworks using deterministic and stochastic sampling, and on conditional and unconditional generation tasks. Without affecting the recall and precision (Kynk a anniemi et al., 2019), our method yields dramatic Fr echet Inception Distance (FID) (Heusel et al., 2017) improvements. Also, we illustrate that Epsilon Scaling effectively reduces the exposure bias by moving the sampling trajectory towards the training trajectory. Overall, our contributions to diffusion models are: We investigate the exposure bias problem in depth and propose a metric for it. We suggest potential solutions to the exposure bias issue and propose a training-free, plugin method (Epsilon Scaling) which significantly improves the sample quality. Our extensive experiments demonstrate the generality of Epsilon Scaling and its wide applicability to different diffusion architectures. 2 RELATED WORK Diffusion models were introduced by Sohl-Dickstein et al. (2015) and later improved by Song & Ermon (2019), Ho et al. (2020) and Nichol & Dhariwal (2021). Song et al. (2021b) unify scorebased models and Denoising Diffusion Probabilistic Models (DDPMs) via stochastic differential equations. Furthermore, Karras et al. (2022) disentangle the design space of diffusion models and introduce the EDM model to further boost the performance in image generation. With the advances in diffusion theory, conditional generation (Ho & Salimans, 2022; Choi et al., 2021) also flourishes in various scenarios, including text-to-image generation (Nichol et al., 2022; Ramesh et al., 2022; Rombach et al., 2022; Saharia et al., 2022), controllable image synthesis (Zhang & Agrawala, 2023; Li et al., 2023b; Zheng et al., 2023), as well as generating other modalities, for instance, audio (Chen et al., 2021; Kong et al., 2021), object shape (Zhou et al., 2021) and time series (Rasul et al., 2021). In the meantime, accelerating the time-consuming reverse diffusion sampling has been extensively investigated in many works (Song et al., 2021a; Lu et al., 2022; Liu et al., 2022). For example, distillation (Salimans & Ho, 2022), Restart sampler (Xu et al., 2023a) and fast ODE samplers (Zhao et al., 2023) have been proposed to speed up the sampling. The exposure bias in diffusion models was first identified by Ning et al. (2023). They introduced a training regularisation term to simulate the sampling prediction errors from the Lipschitz continuity perspective. Additionally, Li et al. (2023a) alleviated exposure bias without retraining and their method involved the manipulation of the time step during the backward generation process. More recently, Li & van der Schaar (2023) estimated the upper bound of cumulative error and optimized it during training. However, the exposure bias in diffusion models still lacks illuminating research in terms of the explicit sampling distribution, metric and root cause, which is the objective of this paper. Besides, our solution to exposure bias is training-free and outperforms previous methods. 3 EXPOSURE BIAS IN DIFFUSION MODELS 3.1 SAMPLING DISTRIBUTION WITH PREDICTION ERROR Given a sample x0 from the data distribution q(x0) and a noise schedule βt, DDPM (Ho et al., 2020) defines the forward perturbation as q(xt|xt 1) = N(xt; 1 βtxt 1, βt I). The Gaussian forward process allows us to directly sample xt conditioned on the input x0: q(xt|x0) = N(xt; αtx0, (1 αt)I), xt = αtx0 + The reverse diffusion is approximated by a neural network pθ(xt 1|xt) = N(xt 1; µθ(xt, t), σt I) and the optimisation objective is DKL(q(xt 1|xt,x0) || pθ(xt 1|xt))) in which the posterior q(xt 1|xt,x0) is tractable when conditioned on x0 using Bayes theorem: q(xt 1|xt,x0) = N(xt 1; µ µ µ(xt,x0), βt I) (2) Published as a conference paper at ICLR 2024 µ µ µ(xt,x0) = αt 1βt 1 αt x0 + αt(1 αt 1) 1 αt xt (3) βt = 1 αt 1 1 αt βt (4) Regarding the parametrization of µθ(xt, t), Ho et al. (2020) found that using a neural network to predict ϵ (Eq. 6) worked better than predicting x0 (Eq. 5) in practice: µθ(xt, t) = αt 1βt 1 αt xθ xθ xθ(xt, t) + αt(1 αt 1) 1 αt xt (5) = 1 αt (xt βt 1 αt ϵθϵθϵθ(xt, t)), (6) where xθ xθ xθ(xt, t) denotes the denoising model which predicts x0 given xt. For simplicity, we use xt θ as the short notation of xθ xθ xθ(xt, t) in the rest of this paper. Comparing Eq. 3 with Eq. 5, Song et al. (2021a) emphasise that the sampling distribution pθ(xt 1|xt) is in fact parameterised as q(xt 1|xt,xt θ) where xt θ means the predicted x0 given xt. Therefore, the practical sampling paradigm is that we first predict ϵ using ϵθ(xt, t). Then we derive the estimation xt θ for x0 using Eq. 1. Finally, based on the posterior q(xt 1|xt,x0), xt 1 is generated using q(xt 1|xt,xt θ) by replacing x0 with xt θ. However, q(xt 1|xt,xt θ) = q(xt 1|xt,x0) holds only if xt θ = x0, this requires the network to make no prediction error about x0 to ensure q(xt 1|xt,xt θ) share the same variance with q(xt 1|xt,x0). However, xt θ x0 is practically non-zero and we claim that the prediction error of x0 needs to be considered to derive the real sampling distribution. Following Analytic-DPM (Bao et al., 2022), we model xt θ as pθ(x0|xt) and approximate it by a Gaussian distribution: pθ(x0|xt) = N(xt θ;x0, e2 t I), xt θ = x0 + etϵ0 (ϵ0 N(0,I)) (7) Taking the prediction error into account, we now compute q(ˆxt|xt+1,xt+1 θ ), which shares the same function format as q(xt 1|xt,xt θ), by substituting with the index t+1 and using ˆxt to highlight that it is generated in the sampling stage. Based on Eq. 2, we know q(ˆxt|xt+1,xt+1 θ ) = N(ˆxt; µθ(xt+1, t+ 1), βt+1I). Its mean and variance can be further derived according to Eq. 5 and Eq. 4, respectively. Thus, a sample from the distribution is ˆxt = µθ(xt+1, t + 1) + q βt+1ϵ1, namely: ˆxt = αtβt+1 1 αt+1 xt+1 θ + αt+1(1 αt) 1 αt+1 xt+1 + q Plugging Eq. 7 and Eq. 1 (using index t + 1 ) into Eq. 8, we derive the final analytical form of ˆxt (see Appendix A.1 for the full derivation): ˆxt = αtx0 + 1 αt + ( αtβt+1 1 αt+1 et+1)2ϵ3 (9) where, ϵ1,ϵ3 N(0,I). From Eq. 9, we can obtain the mean and variance of q(ˆxt|xt+1,xt+1 θ ). For simplicity, we denote q(ˆxt|xt+1,xt+1 θ ) as qθ(ˆxt|xt+1) from now on. In Table 1, q(xt|x0) shows the xt seen by the network during training while qθ(ˆxt|xt+1) indicates the ˆxt exposed to the network during sampling. The method of solving qθ(ˆxt|xt+1) can be generalised to multi-step sampling (detailed in Appendix A.2). In the same spirit of modelling xt θ as a Gaussian, we also derived the sampling distribution qθ(ˆxt|xt+1) for DDIM (Song et al., 2021a) in Appendix A.3. Note that, one can also model xt θ as a Gaussian distribution whose mean is not x0, but it does not affect the variance gap presented in Table 1 and the explosion of variance error (Section 3.2). Table 1: The distribution q(xt|x0) during training and qθ(ˆxt|xt+1) during DDPM sampling. Mean Variance q(xt|x0) αtx0 (1 αt)I qθ(ˆxt|xt+1) αtx0 (1 αt + ( αtβt+1 1 αt+1 et+1)2)I Published as a conference paper at ICLR 2024 3.2 EXPOSURE BIAS DUE TO PREDICTION ERROR It is clear from Table 1 that the variance of the sampling distribution qθ(ˆxt|xt+1) is always larger than the variance of the training distribution q(xt|x0) by the magnitude ( αtβt+1 1 αt+1 et+1)2. Note that, this variance gap between training and sampling is produced just in a single reverse diffusion step, given that the network ϵθϵθϵθ( ) can get access to the ground truth input xt+1. What makes the situation worse is that the error of single-step sampling accumulates in the multi-step sampling, resulting in an explosion of sampling variance error. On CIFAR-10 (Krizhevsky et al., 2009), we designed an experiment to statistically measure both the single-step variance error of qθ(ˆxt|xt+1) and multi-step variance error of qθ(ˆxt|x T ) using 20-step sampling (see Appendix A.4). The experimental results in Fig. 1 indicate that the closer to t = 1 (the end of sampling), the larger the variance error of multistep sampling. The explosion of sampling variance error results in the sampling drift (exposure bias) problem and we attribute the prediction error xt θ x0 as the root cause of the exposure bias in diffusion models. 2 4 6 8 10 12 14 16 18 timestep variance error multi-step single-step Figure 1: Variance error in single-step and multi-step samplings. Intuitively, a possible solution to exposure bias is using a sampling noise variance which is smaller than βt, to counteract the extra variance term ( αtβt+1 1 αt+1 et+1)2 caused by the prediction error xt θ x0. Unfortunately, βt is the lower bound of the sampling noise schedule βt [ βt, βt], where the lower bound and upper bound are the sampling variances given by q(x0) being a delta function and isotropic Gaussian function, respectively (Nichol & Dhariwal, 2021). Therefore, we can conclude that the exposure bias problem can not be alleviated by manipulating the sampling noise schedule βt. Interestingly, Bao et al. (2022) analytically provide the optimal sampling noise schedule β t which is larger than the lower bound βt. Based on what we discussed earlier, β t would cause a more severe exposure bias issue than βt. A strange phenomenon, but not explained by Bao et al. (2022) is that β t leads to a worse FID than using βt under 1000 sampling steps. We believe the exposure bias is in the position to account for this phenomenon: under the long sampling, the negative impact of exposure bias exceeds the positive gain of the optimal variance β t . 3.3 METRIC FOR EXPOSURE BIAS Although some literature has already discussed the exposure bias problem in diffusion models (Ning et al., 2023; Li et al., 2023a), we still lack a well-defined and straightforward metric for this concept. We propose to use the variance error of qθ(ˆxt|x T ) to quantify the exposure bias at timestep t under multi-step sampling. Specifically, our metric δt for exposure bias is defined as δt = ( p β t p βt)2, where βt = 1 αt denotes the variance of q(xt|x0) during training and β t presents the variance of qθ(ˆxt|x T ) in the regular sampling process. The metric δt is inspired by the Fr echet distance (Dowson & Landau, 1982) between q(xt|x0) and qθ(ˆxt|x T ), which is d2 = N( p β t p βt)2 where N is the dimensions of xt. In Appendix A.6, we empirically find that δt exhibits a strong correlation with FID given a trained model. Our method of measuring δt is described in Algorithm 3 (see Appendix A.5). The key step of Algorithm 3 is that we subtract the mean αt 1x0 and the remaining term ˆxt 1 αt 1x0 corresponds to the stochastic term of qθ(ˆxt 1|x T ). 3.4 SOLUTION DISCUSSION We now discuss possible solutions to the exposure bias issue of diffusion models based on the analysis throughout Section 3. Recall that the prediction error of xt θ x0 is the root cause of exposure bias. Thus, the most straightforward way of reducing exposure bias is learning an accurate ϵ or score function (Song & Ermon, 2019) prediction network. For example, by delicately designing the network and hyper-parameters, EDM (Karras et al., 2022) significantly improves the FID from 3.01 to 2.51 on CIFAR-10. Secondly, we believe that data augmentation can reduce the risk of learning Published as a conference paper at ICLR 2024 inaccurate ϵ or score function for ˆxt by learning a denser vector field than vanilla diffusion models. For instance, Karras et al. (2022) has shown that the geometric augmentation (Karras et al., 2020) benefits the network training and sample quality. In the same spirit, Ning et al. (2023) augments each training sample xt by a Gaussian term and achieves substantial improvements in FID. Additionally, Xu et al. (2022) claim that the Poisson Flow framework is more resistant to prediction errors because high mass is allocated across a wide spectrum of the training sample. Our experiments verify the robustness of PGFM++ (Xu et al., 2023b) and show that our solution to exposure bias could further improve the generation quality of PGFM++ (more details in Appendix A.7). It is worth pointing out that the above-mentioned methods require retraining the network and expensive parameter searching during the training. This naturally drives us to the question: Can we alleviate the exposure bias in the sampling stage, without any retraining? 4.1 EPSILON SCALING In Section 3.2, we have concluded that the exposure bias issue can not be solved by reducing the sampling noise variance, thus another direction to be explored in the sampling phase is the prediction of the network ϵθϵθϵθ( ). Since we already know from Table 1 that xt inputted to the network ϵθϵθϵθ( ) in training differs from ˆxt fed into the network ϵθϵθϵθ( ) in sampling, we are interested in understanding the difference in the output of ϵθϵθϵθ( ) between training and inference. 2 4 6 8 10 12 14 16 18 20 timestep sampling direction training sampling Figure 2: Expectation of ϵθϵθϵθ( ) 2 during training and 20-step sampling on CIFAR-10. We report the L2-norm using 50k samples at each timestep. For simplicity, we denote the output of ϵθϵθϵθ( ) as ϵt θ in training and as ϵs θ in sampling. Although the ground truth of ϵs θ is not accessible during inference, we are still able to speculate the behaviour of ϵs θ from the L2-norm perspective. In Fig. 2, we plot the L2-norm of ϵt θ and ϵs θ at each timestep. In detail, given a trained, frozen model and ground truth xt, ϵt θ is collected by ϵt θ = ϵθϵθϵθ(xt, t). In this way, we simulate the training stage and analyse its ϵ prediction. In contrast, ϵs θ is gathered in the real sampling process, namely ϵs θ = ϵθϵθϵθ(ˆxt, t). It is clear from Fig. 2 that the L2-norm of ϵs θ is always larger than that of ϵt θ. Since ˆxt lies around xt with a larger variance (Section 3.1), we can know the network learns an inaccurate vector field ϵθϵθϵθ(ˆxt, t) for each ˆxt in the vicinity of xt with the vector length longer than that of ϵθϵθϵθ(xt, t). One can infer that the prediction ϵs θ could be improved if we can move the input (ˆxt, t) from the inaccurate vector field (green curve in Fig. 2) towards the reliable vector field (red curve in Fig. 2). To this end, we propose to scale down ϵs θ by a factor λ(t) at sampling timestep t. Our solution is based on the observation: ϵt θ and ϵs θ share the same input x T N(0,I) at timestep t = T, but from timestep T 1, ˆxt (input of ϵs θ) starts to diverge from xt (input of ϵt θ) due to the ϵθϵθϵθ( ) error made at previous timestep. This iterative process continues along the sampling chain and results in exposure bias. Therefore, we can push ˆxt closer to xt by scaling down the over-predicted magnitude of ϵs θ. Compared with the regular sampling (Eq. 6), our sampling method only differs in the λt term and is expressed as µθ(xt, t) = 1 αt (xt βt 1 αt ϵθ ϵθ ϵθ(xt,t) λt ). Note that, our Epsilon Scaling serving as a plug-in method adds no computational load to the original sampling of diffusion models. 4.2 THE DESIGN OF SCALING SCHEDULE Similar to the cumulative error analysed in Li & van der Schaar (2023), we emphasise that the L2norm quotient ϵs θ 2 / ϵt θ 2, denoted as N(t), reflects the accumulated effect of the prediction errors made at ancestral steps T, T 1, ..., t+1. Suppose the L2-norm of ϵt θ to be scaled at timestep t is λt, we have: t (λT 1)dt + Z T 1 t (λT 1 1)dt + ... + Z t+1 t (λt+1 1)dt (10) Published as a conference paper at ICLR 2024 given that the error propagates linearly over the sampling chain. The first term on the right side of Eq. 10 corresponds to the error made at timestep T (the start of sampling) and propagated from T to t. Thus, after measuring N(t), one can derive the scaling schedule λ(t) by solving Eq. 10. 5 10 15 20 25 30 35 40 45 50 timestep 20 steps 50 steps Figure 3: N(t) at each timestep. As shown by Nichol & Dhariwal (2021) and Benny & Wolf (2022), the ϵθϵθϵθ( ) predictions near t = 0 are very bad, with the loss larger than other timesteps by several orders of magnitude. Thereby, we can ignore the area close to t = 0 to fit N(t), because scaling a problematic ϵθϵθϵθ( ) does not lead to a better prediction. We plot the N(t) curve in the cases of 20-step and 50-step sampling on CIFAR-10 in Fig. 3 where N(t) can be fitted by a quadratic function in the interval t (5, T). Thus, the solution to Eq. 10 is a linear function λ(t) = kt + b where k, b are constants. Ideally, one should always first do simulations to measure ϵs θ 2 and ϵt θ 2, then solve out λ(t) based on N(t). However, we propose to search for k, b because we find that the parameters do not change significantly in different networks. An added benefit of this proposal is that our approach becomes simulation-free. Moreover, in Section 5.1, we will see that k would decay to 0 around 50-step sampling. Given this fact, we decide a uniform λ(t) = b for most experiments because of its effortless parameter searching and nearoptimal performance. In this section, we evaluate the performance of Epsilon Scaling using FID (Heusel et al., 2017). To demonstrate that Epsilon Scaling is a generic solution to exposure bias, we test the approach on various diffusion frameworks, samplers and conditional settings. Following the fast sampling paradigm (Karras et al., 2022) in the diffusion community, we focus on T 100 sampling steps in this section and leave the FID results of T > 100 in Appendix A.8. Our FID computation is consistent with (Dhariwal & Nichol, 2021) for equal comparison. All FIDs are reported using 50k generated samples and the full training set as the reference batch. Lastly, Epsilon Scaling does not affect the precision and recall, and we report these results in Appendix A.9. 5.1 MAIN RESULTS ON ADM Since Epsilon Scaling is a training-free method, we utilise the pre-trained ADM model as the baseline and compare it against our ADM-ES (ADM with Epsilon Scaling) on datasets CIFAR10 (Krizhevsky et al., 2009), LSUN tower (Yu et al., 2015) and FFHQ (Karras et al., 2019) for unconditional generation and on datasets Image Net 64 64 and Image Net 128 128 (Chrabaszcz et al., 2017) for class-conditional generation. We employ the respacing sampling technique (Nichol & Dhariwal, 2021) to enable fast stochastic sampling. Table 2: FID on ADM baseline. We compare ADM with our ADM-ES (uniform λ(t)) and ADM-ES (linear λ(t)). Image Net 64 64 results are reported without classifier guidance and Image Net 128 128 is under classifier guidance with scale=0.5 T Model Unconditional Conditional CIFAR-10 32 32 LSUN 64 64 FFHQ 128 128 Image Net 64 64 Image Net 128 128 100 ADM 3.37 3.59 14.52 2.71 3.55 ADM-ES 2.17 2.91 6.77 2.39 3.37 50 ADM 4.43 7.28 26.15 3.75 5.15 ADM-ES 2.49 3.68 9.50 3.07 4.33 20 ADM 10.36 23.92 59.35 10.96 12.48 ADM-ES 5.15 8.22 26.14 7.52 9.95 ADM-ES 4.31 7.60 24.83 7.37 9.86 Table 3: We compare ADM-ES with recent stochastic diffusion (SDE) samplers regarding FID. We report their best FID achieved under T sampling steps. Model T Unconditional CIFAR-10 32 32 EDM (VP) (Karras et al., 2022) 511 2.27 EDM (VE) (Karras et al., 2022) 2047 2.23 Improved SDE (Karras et al., 2022) 1023 2.35 Restart (VP) (Xu et al., 2023a) 115 2.21 SA-Solver (Xue et al., 2023) 95 2.63 ADM-IP (Ning et al., 2023) 100 2.38 ADM-ES (ours) 50 2.49 ADM-ES (ours) 100 2.17 Published as a conference paper at ICLR 2024 Table 2 shows that independent of datasets and the number of sampling steps T , our ADM-ES outperforms ADM by a large margin in terms of FID. For instance, on FFHQ 128 128, ADM-ES exhibits less than half the FID of ADM, with 7.75, 16.65 and 34.52 FID improvement under 100, 50 and 20 sampling steps, respectively. Moreover, when compared with the previous best stochastic samplers, ADM-ES outperforms EDM (Karras et al., 2022), Improved SDE (Karras et al., 2022), Restart Sampler (Xu et al., 2023a) and SA-Solver (Xue et al., 2023), exhibiting state-of-the-art stochastic sampler (SDE solver). For example, ADM-ES not only achieves a better FID (2.17) than EDM and Improved SDE, but also accelerates the sampling speed by 5-fold to 20-fold (see Table 3). Even under 50-step sampling, Epsilon Scaling surpasses SA-Solver and obtains competitive FID against other samplers. Note that, ADM-ES uses uniform schedule λ(t) = b and ADM-ES applies the linear schedule λ(t) = kt+b in Table 2. We find that the slope k is approaching 0 as the sampling step T increases. Therefore, we suggest a uniform schedule λ(t) for practical consideration. We present the complete parameters k, b used in all experiments and the details on the search of k, b in Appendix A.10. Overall, searching for the optimal uniform λ(t) is effortless and takes 6 to 10 trials. In Appendix A.11, we also demonstrate that the FID gain can be achieved within a wide range of λ(t), which indicates the insensitivity of λ(t). 5.2 EPSILON SCALING ALLEVIATES EXPOSURE BIAS Apart from the FID improvements, we now show the exposure bias alleviated by our method using the proposed metric δt and we also demonstrate the sampling trajectory corrected by Epsilon Scaling. Using Algorithm 3, we measure δt on the dataset CIFAR-10 under 20-step sampling for ADM and ADM-ES models. Fig. 4 shows that ADM-ES obtains a lower δt at the end of sampling t = 1 than the baseline ADM, exhibiting a smaller variance error and sampling drift (see Appendix A.12 for results on other datasets). Based on Fig. 2, we apply the same method to measure the L2-norm of ϵθϵθϵθ( ) in the sampling phase with Epsilon Scaling. Fig. 5 indicates that our method explicitly moves the original sampling trajectory closer to the vector field learned in the training phase given the condition that the ϵθϵθϵθ(xt) 2 is locally monotonic around xt. This condition is satisfied in denoising networks (Goodfellow et al., 2016; Song & Ermon, 2019) because of the monotonic score vectors around the local maximal probability density. We emphasise that Epsilon Scaling corrects the magnitude error of ϵθϵθϵθ( ), but not the direction error. Thus we can not completely eliminate the exposure bias to achieve δt = 0 or push the sampling trajectory to the exact training vector field. 2 4 6 8 10 12 14 16 18 timestep sampling direction Figure 4: Exposure bias measured by δt on LSUN 64 64. Epsilon Scaling achieves a smaller δt at the end of sampling (t = 1) 0 3 6 9 12 15 18 21 timestep training sampling sampling after Epsilon Scaling Figure 5: ϵθϵθϵθ( ) 2 on LSUN 64 64. After applying Epsilon Scaling, the sampling ϵθ 2 (blue) gets closer to the training ϵθ 2 (red). 5.3 RESULTS ON DDIM/DDPM To show the generality of our proposed method, we conduct experiments on DDIM/DDPM framework across CIFAR-10 and Celeb A 64 64 datasets (Liu et al., 2015). The results are detailed in Table 4, wherein the designations η = 0 and η = 1 correspond to DDIM and DDPM samplers, respectively. The findings in Table 3 illustrate that our method can further boost the performance Published as a conference paper at ICLR 2024 of both DDIM and DDPM samplers on the CIFAR-10 and Celeb A datasets. Specifically, our proposed Epsilon Scaling technique improves the performance of DDPM sampler on Celeb A dataset by 47.7%, 63.1%, 60.7% with 20, 50, and 100 sampling steps, respectively. Similar performance improvement can also be observed on CIFAR-10 dataset. We also notice that our method brings less performance improvement for DDIM sampler. This could arise from the FID advantage of deterministic sampling under a short sampling chain and the noise term in DDPM sampler can actively correct for errors made in earlier sampling steps Karras et al. (2022). Table 4: FID on DDIM baseline for unconditional generations. T Model CIFAR-10 32 32 Celeb A 64 64 η = 0 η = 1 η = 0 η = 1 100 DDIM 4.06 6.73 5.67 11.33 DDIM-ES (ours) 3.38 4.01 5.05 4.45 50 DDIM 4.82 10.29 6.88 15.09 DDIM-ES 4.17 4.57 6.20 5.57 20 DDIM 8.21 20.15 10.43 22.61 DDIM-ES 6.54 7.78 10.38 11.83 Table 5: FID on EDM baseline and CIFAR-10 dateset (FID of EDM is reproduced). T Model Unconditional Conditional Heun Euler Heun Euler 35 EDM 1.97 3.81 1.82 3.74 EDM-ES (ours) 1.95 2.80 1.80 2.59 21 EDM 2.33 6.29 2.17 5.91 EDM-ES 2.24 4.32 2.08 3.74 13 EDM 7.16 12.28 6.69 10.66 EDM-ES 6.54 8.39 6.16 6.59 5.4 RESULTS ON EDM We test the effectiveness of Epsilon Scaling on EDM (Karras et al., 2022) because it achieves stateof-the-art image generation under a few sampling steps and provides a unified framework for diffusion models. Since the main advantage of EDM is its Ordinary Differential Equation (ODE) solver, we evaluate our Epsilon Scaling using their Heun 2nd order ODE solver (Ascher & Petzold, 1998) and Euler 1st order ODE solver, respectively. Although the network output of EDM is not ϵ, we still can extract the signal ϵ at each sampling step and then apply Epsilon Scaling on ϵ. The experiments are implemented on CIFAR-10 dataset and we report the FID results in Table 5 using VP framework. The sampling step T in Table 5 is equivalent to the Neural Function Evaluations (NFE) used in EDM paper. Similar to the results on ADM and DDIM, Epsilon Scaling gains consistent FID improvement on EDM baseline regardless of the conditional settings and the ODE solver types. For instance, EDM-ES improves the FID from 3.81 to 2.80 and from 3.74 to 2.59 in the unconditional and conditional groups using the 35-step Euler sampler. An interesting phenomenon in Table 5 is that the FID gain of Epsilon Scaling in the Euler sampler group is larger than that in the Heun sampler group. We believe there are two factors accounting for this phenomenon. On the one hand, higher-order ODE solvers (for example, Heun solvers) introduce less truncation error than Euler 1st order solvers. On the other hand, the correction steps in the Heun solver reduce the exposure bias by pulling the drifted sampling trajectory back to the accurate vector field. We illustrate these two factors through Fig. 6 which is plotted using the same method of Fig. 2. It is apparent from Fig. 6 that the Heun sampler exhibits a smaller gap between the training trajectory and sampling trajectory when compared with the Euler sampler. This corresponds to the truncation error factor in these two ODE solvers. Furthermore, in the Heun 2nd ODE sampler, the prediction error (cause of exposure bias) made in each Euler step is corrected in the subsequent Correction step (Fig. 6(b)), resulting in a reduced exposure bias. This exposure bias perspective explains the superiority of the Heun solver in diffusion models beyond the truncation error viewpoint. 5.5 RESULTS ON LDM To further verify the generality of Epsilon Scaling, we adopt Latent Diffusion Model (LDM) as the base model which introduces an Autoeoconder and performs the diffusion process in the latent space (Rombach et al., 2022). We test the performance of Epsilon Scaling (LDM-ES) on FFHQ 256 256 and Celeb A-HQ 256 256 datasets using T steps DDPM sampler. It is clear from Table 6 that Epsilon Scaling gains substantial FID improvements on the two high-resolution datasets, where LDM-ES achieves 15.68 FID under T = 20 on Celeb A-HQ, almost half that of LDM. Published as a conference paper at ICLR 2024 2 4 6 8 10 12 14 16 18 20 timestep training sampling (a) EDM: Euler 1st order sampler 2 4 6 8 10 12 14 16 18 20 timestep Correction step training sampling (b) EDM: Heun 2nd order sampler Figure 6: ϵθϵθϵθ( ) 2 during training and sampling on CIFAR-10. We use 21-step sampling and report the L2-norm using 50k samples at each timestep. The sampling is from right to left in the figures. Table 6: FID on LDM baseline using DDPM unconditional sampling. T Model FFHQ 256 256 Celeb A-HQ 256 256 100 LDM 10.90 9.31 LDM-ES (ours) 9.83 7.36 50 LDM 14.34 13.95 LDM-ES 11.57 9.16 20 LDM 33.13 29.62 LDM-ES 20.91 15.68 Epsilon Scaling also yields better FID under 50 and 100 sampling steps on Celeb A-HQ with 7.36 FID at T = 100. Similar FID improvements are obtained on FFHQ dataset over different T . Finally, Epsilon Scaling is also effective on Di T (Peebles & Xie, 2023) which applies the Vi T (Dosovitskiy et al., 2020) diffusion backbone in LDM. Please refer to Appendix A.13 for the FID results on Di T baseline. 5.6 QUALITATIVE COMPARISON Figure 7: Qualitative comparison between ADM (first row) and ADM-ES (second row). In order to visually show the effect of Epsilon Scaling on image synthesis, we set the same random seed for the base model and our Epsilon Scaling model in the sampling phase to ensure a similar trajectory for both models. Fig. 7 displays the generated samples using 100 steps on FFHQ 128 128 dataset. It is clear that ADM-ES effectively refines the sample issues of ADM, including overexposure, underexposure, coarse background and detail defects from left to right in Fig. 7 (see Appendix A.14 for more qualitative comparisons). Besides, the qualitative comparison also empirically confirms that Epsilon Scaling guides the sampling trajectory of the base model to an adjacent but better probability path because both models reach the same or similar modes given the common starting point x T and the same random seed at each sampling step. 6 CONCLUSIONS In this paper, we elucidate the exposure bias issue in diffusion models by analytically showing the difference between the training distribution and sampling distribution. Moreover, we propose a training-free method to refine the deficient sampling trajectory by explicitly scaling the prediction vector. Through extensive experiments, we demonstrate that Epsilon Scaling is a generic solution to exposure bias and its simplicity enables a wide range of diffusion applications. Uri M Ascher and Linda R Petzold. Computer methods for ordinary differential equations and differential-algebraic equations, volume 61. Siam, 1998. Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. ICLR, 2022. Published as a conference paper at ICLR 2024 Yaniv Benny and Lior Wolf. Dynamic dual-output diffusion models. In CVPR, pp. 11482 11491, 2022. Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. ICLR, 2021. Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models. In ICCV, pp. 14367 14376, 2021. Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of Image Net as an alternative to the CIFAR datasets. ar Xiv:1707.08819, 2017. Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. Neur IPS, 34:8780 8794, 2021. Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ICLR, 2017. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. DC Dowson and BV666017 Landau. The fr echet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450 455, 1982. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Neur IPS, 27, 2014. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Neur IPS, 30, 2017. Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Neur IPS, 33: 6840 6851, 2020. Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. ar Xiv preprint ar Xiv:2210.02303, 2022. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, pp. 4401 4410, 2019. Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. Neur IPS, 33:12104 12114, 2020. Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusionbased generative models. Neur IPS, 35:26565 26577, 2022. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2014. Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diff Wave: A versatile diffusion model for audio synthesis. In ICLR, 2021. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Published as a conference paper at ICLR 2024 Tuomas Kynk a anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. Neur IPS, 32, 2019. Mingxiao Li, Tingyu Qu, Wei Sun, and Marie-Francine Moens. Alleviating exposure bias in diffusion models through sampling with shifted time steps. ar Xiv preprint ar Xiv:2305.15583, 2023a. Yangming Li and Mihaela van der Schaar. On error propagation of diffusion models. In ICLR, 2023. Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In CVPR, pp. 22511 22521, 2023b. Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In ICLR, 2022. URL https://openreview.net/forum?id= Pl KWVd2y Bk Y. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In ICCV, December 2015. Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Neur IPS, 35:5775 5787, 2022. Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, pp. 8162 8171. PMLR, 2021. Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, pp. 16784 16804. PMLR, 2022. Mang Ning, Enver Sangineto, Angelo Porrello, Simone Calderara, and Rita Cucchiara. Input perturbation reduces exposure bias in diffusion models. ICML, 2023. William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, pp. 4195 4205, 2023. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with CLIP latents. ar Xiv:2204.06125, 2022. Marc Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In ICLR, 2016. Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In ICML, 2021. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In CVPR, pp. 10684 10695, 2022. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234 241. Springer, 2015. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Neur IPS, 35:36479 36494, 2022. Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022. Florian Schmidt. Generalization in generation: A closer look at exposure bias. In Proceedings of the 3rd Workshop on Neural Generation and Translation@EMNLP-IJCNLP, 2019. Published as a conference paper at ICLR 2024 Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021a. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Neur IPS, 32, 2019. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021b. Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs. In ICLR, 2022. Yilun Xu, Ziming Liu, Max Tegmark, and Tommi Jaakkola. Poisson flow generative models. Neur IPS, 35:16782 16795, 2022. Yilun Xu, Mingyang Deng, Xiang Cheng, Yonglong Tian, Ziming Liu, and Tommi Jaakkola. Restart sampling for improving generative processes. ar Xiv preprint ar Xiv:2306.14878, 2023a. Yilun Xu, Ziming Liu, Yonglong Tian, Shangyuan Tong, Max Tegmark, and Tommi Jaakkola. Pfgm++: Unlocking the potential of physics-inspired generative models. In ICML, pp. 38566 38591. PMLR, 2023b. Shuchen Xue, Mingyang Yi, Weijian Luo, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhi-Ming Ma. Sa-solver: Stochastic adams solver for fast sampling of diffusion models. ar Xiv preprint ar Xiv:2309.05019, 2023. Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. ar Xiv preprint ar Xiv:1506.03365, 2015. Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. ar Xiv preprint ar Xiv:2302.05543, 2023. Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictorcorrector framework for fast sampling of diffusion models. ar Xiv preprint ar Xiv:2302.04867, 2023. Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In CVPR, pp. 22490 22499, 2023. Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. In ICCV, pp. 5826 5835, 2021. Published as a conference paper at ICLR 2024 A.1 DERIVATION OF qθ(ˆxt|xt+1) FOR DDPM We show the full derivation of Eq. 3 below. From Eq. 11 to Eq. 12, we plug in xt+1 θ = x0 + et+1ϵ0 (Eq. 7) and xt+1 = αt+1x0 + 1 αt+1ϵ (Eq. 1), thus a sample from qθ(ˆxt|xt+1) is: ˆxt = µθ(xt+1, t + 1) + q = αtβt+1 1 αt+1 xt+1 θ + αt+1(1 αt) 1 αt+1 xt+1 + q βt+1ϵ1 (11) = αtβt+1 1 αt+1 (x0 + et+1ϵ0) + αt+1(1 αt) 1 αt+1 ( αt+1x0 + p 1 αt+1ϵ) + q βt+1ϵ1 (12) = αtβt+1 1 αt+1 x0 + αt+1(1 αt) αt+1x0 + αtβt+1 1 αt+1 et+1ϵ0 + αt+1(1 αt) 1 αt+1ϵ + q = αtβt+1 + αt+1(1 αt) αt+1 1 αt+1 x0 + αtβt+1 1 αt+1 et+1ϵ0 + αt+1(1 αt) 1 αt+1ϵ + q = αt(1 αt+1) + αt+1(1 αt) αt+1 1 αt+1 x0 + αtβt+1 1 αt+1 et+1ϵ0 + αt+1(1 αt) 1 αt+1ϵ + q = αt(1 αt+1) + αt+1(1 αt) αt 1 αt+1 x0 + αtβt+1 1 αt+1 et+1ϵ0 + αt+1(1 αt) 1 αt+1ϵ + q = αt(1 αt+1 + αt+1 αt+1) 1 αt+1 x0 + αtβt+1 1 αt+1 et+1ϵ0 + αt+1(1 αt) 1 αt+1ϵ + q = αtx0 + αtβt+1 1 αt+1 et+1ϵ0 + αt+1(1 αt) 1 αt+1ϵ + q βt+1ϵ1 (13) where, ϵ0,ϵ,ϵ1 N(0,I). From Eq. 13, we know that the mean of qθ(ˆxt|xt+1) is αtx0. We now focus on the variance by looking at αtβt+1 1 αt+1 et+1ϵ0 + αt+1(1 αt) 1 αt+1 1 αt+1ϵ + q V ar(ˆxt) = ( αtβt+1 1 αt+1 et+1)2 + ( αt+1(1 αt) 1 αt+1)2 + βt+1 (14) = ( αtβt+1 1 αt+1 et+1)2 + ( αt+1(1 αt) 1 αt+1)2 + (1 αt)(1 αt+1) = ( αtβt+1 1 αt+1 et+1)2 + αt+1(1 αt)2 1 αt+1 + (1 αt)(1 αt+1) = ( αtβt+1 1 αt+1 et+1)2 + αt+1(1 αt)2 + (1 αt)(1 αt+1) = ( αtβt+1 1 αt+1 et+1)2 + (1 αt)[αt+1(1 αt) + (1 αt+1)] = ( αtβt+1 1 αt+1 et+1)2 + (1 αt)[αt+1 αt+1 + 1 αt+1] = ( αtβt+1 1 αt+1 et+1)2 + (1 αt)[1 αt+1] = ( αtβt+1 1 αt+1 et+1)2 + 1 αt (15) Published as a conference paper at ICLR 2024 A.2 DERIVATION OF qθ(ˆxt 1|xt+1) AND MORE FOR DDPM Since qθ(ˆxt 1|xt+1) contains two consecutive sampling steps: qθ(ˆxt|xt+1) and qθ(ˆxt 1|ˆxt), we can solve out qθ(ˆxt 1|xt+1) by iterative plugging-in. According to qθ(ˆxt|xt+1) = N(ˆxt; µθ(xt+1, t + 1), βt+1I) and Eq. 11, we know that qθ(ˆxt 1|ˆxt) = N(ˆxt 1; µθ(ˆxt, t), βt I) and a sample from qθ(ˆxt 1|ˆxt) is: ˆxt 1 = αt 1βt 1 αt xt θ + αt(1 αt 1) 1 αt ˆxt + q From Table 1, we know that qθ(ˆxt|xt+1) = N(ˆxt; αtx0, (1 αt + ( αtβt+1 1 αt+1 et+1)2)I), so plug in ˆxt = αtx0 + q 1 αt + ( αtβt+1 1 αt+1 et+1)2ϵ3 into Eq. 16, we know a sample from qθ(ˆxt 1|xt+1) is: ˆxt 1 = αt 1βt 1 αt xt θ + αt(1 αt 1) 1 αt ( αtx0+ 1 αt + ( αtβt+1 1 αt+1 et+1)2ϵ3)+ q By denoting ( αtβt+1 1 αt+1 et+1)2 as f(t) and plugging in xt θ = x0 + etϵ0 (Eq. 7), we have: ˆxt 1 = αt 1βt 1 αt (x0 + etϵ0) + αt(1 αt 1) 1 αt ( αtx0 + p 1 αt + f(t)ϵ3) + q 1 αt (x0 + etϵ0) + αt(1 αt 1) 1 αt ( αtx0 + 1 αtϵ3 + 1 2 1 αt f(t)ϵ3) + q 1 αt x0 + αt 1βt 1 αt etϵ0 + αt(1 αt 1) + αt(1 αt 1) 1 αtϵ3 + 1 2 1 αt f(t)ϵ3) + q αt 1x0 + αt 1βt 1 αt etϵ0 + αt(1 αt 1) 1 αt + 1 2 1 αt f(t))ϵ3 + q Taylor s theorem is used from Eq. 18 to Eq. 19. The process from Eq. 20 to Eq. 21 is similar to the simplification from Eq. 12 to Eq. 13. From Eq. 21, we know that the mean of qθ(ˆxt 1|xt+1) is αt 1x0. We now focus on the variance: V ar(ˆxt 1) = ( αt 1βt 1 αt et)2 + ( αt(1 αt 1) 1 αt)2 + βt + ( αt(1 αt 1) 1 2 1 αt f(t))2 (22) = 1 αt 1 + ( αt 1βt 1 αt et)2 + αt(1 αt 1)2 4(1 αt)3 f(t)2 (23) The above derivation is similar to the progress from Eq. 14 to Eq. 15. Now we write the mean and variance of qθ(ˆxt 1|xt+1) in Table 7. In the same spirit of iterative plugging-in, we could derive (ˆxt|x T ) which has the mean αtx0 and variance larger than (1 αt)I. A.3 DERIVATION OF qθ(ˆxt|xt+1) FOR DDIM We first review the derivation of the reverse diffusion pθ(xt 1|xt) for DDIM. To keep the symbols consistent in this paper, we continue to use the notations of DDPM in the derivation of DDIM. Published as a conference paper at ICLR 2024 Table 7: The distribution q(xt 1|x0) during training and qθ(ˆxt 1|xt+1) during DDPM sampling. Mean Variance q(xt 1|x0) αt 1x0 (1 αt 1)I qθ(ˆxt 1|xt+1) αt 1x0 (1 αt 1 + ( αt 1βt 1 αt et)2 + αt(1 αt 1)2 4(1 αt)3 f(t)2)I Recall that DDIM and DDPM have the same loss function because they share the same marginal distribution q(xt|x0) = N(xt; αtx0, (1 αt)I). But the posterior q(xt 1|xt,x0) of DDIM is obtained under Non-Markovian diffusion process and is given by Song et al. (2021a): q(xt 1|xt,x0) = N( αt 1x0 + q 1 αt 1 σ2 t xt αtx0 1 αt , σ2 t I). (24) Similar to DDPM, the reverse distribution of DDIM is parameterized as pθ(xt 1|xt) = q(xt 1|xt,xt θ), where xt θ means the predicted x0 given xt. Based on Eq. 24, the reverse diffusion q(xt 1|xt,xt θ) is: q(xt 1|xt,xt θ) = N( αt 1xt θ + q 1 αt 1 σ2 t xt αtxt θ 1 αt , σ2 t I). (25) Again, we point out that q(xt 1|xt,xt θ) = q(xt 1|xt,x0) holds only if xt θ = x0, this requires the network to make no prediction error about x0. Theoretically, we need to consider the uncertainty of the prediction xt θ and model it as a probabilistic distribution pθ(x0|xt). Following Analytical DPM (Bao et al., 2022), we approximate it by a Gaussian distribution pθ(x0|xt) = N(xt θ;x0, e2 t I), namely xt θ = x0 + etϵ0. Thus, the practical reverse diffusion q(xt 1|xt,xt θ) is q(xt 1|xt,xt θ) = N( αt 1(x0 + etϵ0) + q 1 αt 1 σ2 t xt αt(x0 + etϵ0) 1 αt , σ2 t I). (26) Note that σt = 0 for DDIM sampler, so a sample xt 1 from q(xt 1|xt,xt θ) is: xt 1 = αt 1(x0 + etϵ0) + q 1 αt 1 σ2 t xt αt(x0 + etϵ0) 1 αt + σtϵ4 = αt 1x0 + αt 1etϵ0 + p 1 αt 1 xt αtx0 1 αt p 1 αt 1 αtetϵ0 1 αt (27) = αt 1x0 + αt 1etϵ0 + p 1 αt 1 αtetϵ0 1 αt (28) = αt 1x0 + p 1 αt 1ϵ5 + ( αt 1et p 1 αt 1 αtet 1 αt )ϵ0 (29) From Eq. 27 to Eq. 28, we plug in xt = αtx0 + 1 αtϵ5 where ϵ5 N(0,I). We now compute the sampling distribution q(ˆxt|xt+1,xt+1 θ ) which is the same distribution as q(xt 1|xt,xt θ) by replacing the index t with t + 1 and using ˆxt to highlight it is a generated sample. According to Eq. 29, a sample ˆxt from q(ˆxt|xt+1,xt+1 θ ) is: ˆxt = αtx0 + 1 αtϵ5 + ( αtet+1 1 αt αt+1et+1 1 αt+1 )ϵ0 (30) Published as a conference paper at ICLR 2024 From Eq. 30, we know the mean of q(ˆxt|xt+1,xt+1 θ ) is αtx0. We now calculate the variance by looking at 1 αtϵ5 + ( αtet+1 1 αt αt+1et+1 1 αt+1 )ϵ0: V ar(ˆxt) = ( 1 αt)2 + ( αtet+1 1 αt αt+1et+1 1 αt+1 )2 = 1 αt + ( αt 1 αt αt+1 1 αt+1 )2e2 t+1 = 1 αt + ( αt 1 αt αt αt+1 1 αt+1 )2e2 t+1 = 1 αt + ( αt(1 1 αt αt+1 1 αt+1 ))2e2 t+1 = 1 αt + αt(1 αt+1 αt+1 1 αt+1 )2e2 t+1 = 1 αt + (1 rαt+1 αt+1 1 αt+1 )2 αte2 t+1 (31) As a result, we can write the mean and variance of the sampling distribution q(ˆxt|xt+1,xt+1 θ ), i.e. qθ(ˆxt|xt+1), and compare it with the training distribution q(xt|x0) in Table 8. Table 8: The mean and variance of q(xt|x0) during training and qθ(ˆxt|xt+1) during DDIM sampling. Mean Variance q(xt|x0) αtx0 (1 αt)I qθ(ˆxt|xt+1) αtx0 (1 αt + (1 q 1 αt+1 )2 αte2 t+1)I Since αt+1 < 1, q 1 αt+1 < 1 and (1 q 1 αt+1 ) > 0 hold for any t in Eq. 31. Similar to DDPM sampler, the variance of q(ˆxt|xt+1,xt+1 θ ) is always larger than that of q(xt|x0) by the magnitude (1 q 1 αt+1 )2 αte2 t+1, indicating the exposure bias issue in DDIM sampler. A.4 PRACTICAL VARIANCE ERROR OF qθ(ˆxt|xt+1) AND qθ(ˆxt|x T ) We measure the single-step variance error of qθ(ˆxt|xt+1) and multi-step variance error of qθ(ˆxt|x T ) using Algorithm 1 and Algorithm 2, respectively. Note that, the multi-step variance error measurement is similar to the exposure bias δt evaluation and we denote the single-step variance error as t and represent the multi-step variance error as t. The experiments are implemented on CIFAR-10 (Krizhevsky et al., 2009) dataset and ADM model (Dhariwal & Nichol, 2021). The key difference between t and t measurement is that the former can get access to the ground truth input xt at each sampling step t, while the latter is only exposed to the predicted ˆxt in the iterative sampling process. Published as a conference paper at ICLR 2024 Algorithm 1 Variance error under single-step sampling 1: Initialize t = 0, nt = list() ( t {1, ..., T 1}) 2: for t := T, ..., 1 do 3: repeat 4: x0 q(x0), ϵ N(0,I) 5: xt = αtx0 + 1 αtϵ 6: ˆxt 1 = 1 αt (xt βt 1 αtϵθ(xt, t)) + q βtz (z N(0,I)) 7: nt 1.append(ˆxt 1 αt 1x0) 8: until 50k iterations 9: end for 10: for t := T, ..., 1 do 11: ˆβt = numpy.var(nt) 12: t = ˆβt βt 13: end for Algorithm 2 Variance error under multi-step sampling 1: Initialize δt = 0, nt = list() ( t {1, ..., T 1}) 2: repeat 3: x0 q(x0), ϵ N(0,I) 4: x T = αTx0 + 1 αTϵ 5: for t := T, ..., 1 do 6: if t == T then ˆxt = x T 7: ˆxt 1 = 1 αt (ˆxt βt 1 αtϵθ(ˆxt, t)) + q βtz (z N(0,I)) 8: nt 1.append(ˆxt 1 αt 1x0) 9: end for 10: until 50k iterations 11: for t := T, ..., 1 do 12: ˆβt = numpy.var(nt) 13: t = ˆβt βt 14: end for A.5 METRIC FOR EXPOSURE BIAS The key step of Algorithm 3 is that we subtract the mean αt 1x0 and the remaining term ˆxt 1 αt 1x0 corresponds to the stochastic term of q(ˆxt 1|xt,xt θ). In our experiments, we use N = 50, 000 samples to compute the variance ˆβt. Algorithm 3 Measurement of Exposure Bias δt 1: Initialize δt = 0, nt = list() ( t {1, ..., T 1}) 2: repeat 3: x0 q(x0), ϵ N(0,I) 4: compute x T using Eq. 1 5: for t := T, ..., 1 do 6: if t == T then ˆxt = x T 7: ˆxt 1 = 1 αt (ˆxt βt 1 αtϵθ(ˆxt, t)) + q βtz (z N(0,I)) 8: nt 1.append(ˆxt 1 αt 1x0) 9: end for 10: until N iterations 11: for t := T, ..., 1 do 12: ˆβt = numpy.var(nt) 13: δt = ( q 14: end for A.6 CORRELATION BETWEEN EXPOSURE BIAS METRIC AND FID 5 6 7 8 9 10 FID Figure 8: Correlation between FID - δ1. We define the exposure bias at timestep t as δt = ( p β t p βt)2, where βt = 1 αt denotes the variance of q(xt|x0) during training and β t presents the variance of qθ(ˆxt|x T ) in the regular sampling process. Although δt measures the discrepancy between network inputs and FID to evaluate the difference between training data and network outputs, we empirically find a strong correlation between δt and FID, which could arise from the benefit of defining δt from the Fr echet distance Dowson & Landau Published as a conference paper at ICLR 2024 (1982) perspective. In Fig. 8, we present the FID-δ1 relationships on CIFAR-10 and use 20-step sampling, wherein δ1 represents the exposure bias in the last sampling step t = 1. Additionally, δt has the advantage of indicating the network input quality at any intermediate timestep t. Taking Fig. 4 as an example, we can see that the input quality decreases dramatically near the end of sampling (t = 1) as δt increases significantly. A.7 FID RESULTS ON PFGM++ BASELINE In PFGM, Xu et al. (2022) claim that the strong norm-t correlation causes the sensitivity of prediction errors in diffusion frameworks, and PFGM is more resistant to prediction errors due to the greater range of training sample norms. Our experiments verify the robustness of Poisson Flow frameworks by observing that Epsilon Scaling enjoys a wider range of λ(t) on PFGM++ baseline (Xu et al., 2023b) than on EDM baseline. The rationale is simple: an inappropriate λ((t)) introduces errors that require the network to be robust to counteract. However, we observe that the exposure bias issue still exists in Poisson Flow frameworks even though they are less sensitive to prediction errors than diffusion frameworks. Therefore, Epsilon Scaling is applicable in Poisson Flow models. Table 9 and Table 10 show the significant improvements made by Epsilon Scaling (λ(t) = b) on PFGM++. Table 9: FID on CIFAR-10 using PFGM++ baseline (D=128, uncond). PFGM++ 37.82 7.55 2.34 1.92 PFGM++ with ES 17.91 (b=1.008) 4.51 (b=1.016) 2.31 (b=0.970) 1.91 (b=1.00045) Table 10: FID on CIFAR-10 using PFGM++ baseline (D=2048, uncond). PFGM++ 37.16 7.34 2.31 1.91 PFGM++ with ES 17.27 (b=1.007) 4.88 (b=1.015) 2.20 (b=0.996) 1.90 (b=1.0007) A.8 FID RESULTS UNDER T > 100 Although using sampling step T = 100 often achieves the near-optimal FID in diffusion models, we still report the performance of our Epsilon Scaling in the large T regions. We show the FID results on Analytic-DDPM (Bao et al., 2022) baseline (Table 11) and ADM baseline (Table 12), where we apply λ(t) = b for Epsilon Scaling. Table 11: FID on CIFAR-10 using Analytic-DDPM baseline (linear noise schedule). 20 50 100 200 400 1000 Analytic-DDPM 14.61 7.25 5.40 4.01 3.62 4.03 Analytic-DDPM with ES 11.02 5.03 4.09 3.39 3.14 3.42 Table 12: FID on CIFAR-10 using ADM baseline. 200 300 1000 ADM 3.04 2.95 3.01 ADM-ES 2.15 2.14 2.21 Published as a conference paper at ICLR 2024 A.9 RECALL AND PRECISION RESULTS Our method Epsilon Scaling does not affect the recall and precision of the base model. We present the complete recall and precision (Kynk a anniemi et al., 2019) results in Table 13 using the code provided by ADM (Dhariwal & Nichol, 2021). ADM-ES achieve higher recalls and slightly lower previsions across the five datasets. But the overall differences are minor. Table 13: Recall and precision of ADM and ADM-ES using 100-step sampling. Model CIFAR-10 32 32 LSUN tower 64 64 FFHQ 128 128 Image Net 64 64 Image Net 128 128 recall precision recall precision recall precision recall precision recall precision ADM 0.591 0.691 0.605 0.645 0.497 0.696 0.621 0.738 0.586 0.771 ADM-ES 0.613 0.684 0.606 0.641 0.545 0.683 0.632 0.726 0.592 0.771 A.10 EPSILON SCALING PARAMETERS: k, b We present the parameters k, b of Epsilon Scaling we used in all of our experiments in Table 14, Table 15 and Table 16 for reproducibility. Apart from that, we provide suggestions on finding the optimal parameters even though they are dependent on the dataset and how well the base model is trained. Our suggestions are: Search for the optimal uniform schedule λ(t) = b in a coarse-to-fine manner: use stride 0.001, 0.0005, 0.0001 progressively. In general, the optimal b will decrease as the sampling step T increases. After getting the optimal uniform schedule λ(t) = b , we search for k of the linear schedule λ(t) = kt+b by keeping Σλ(t) = Σλ(t) , thereby, b in the linear schedule is calculated rather than being searched. Instead of generating 50k samples, using 10k samples to compute FID for searching λ(t). Table 14: Epsilon Scaling schedule λ(t) = kt + b we used on ADM baseline. We keep the FID results in the table for comparisons and remark k, b underneath FIDs T Model Unconditional Conditional CIFAR-10 32 32 LSUN tower 64 64 FFHQ 128 128 Image Net 64 64 Image Net 128 128 100 ADM 3.37 3.59 14.52 2.71 3.55 ADM-ES 2.17 (b=1.017) 2.91 (b=1.006) 6.77 (b=1.005) 2.39 (b=1.006) 3.37 (b=1.004) 50 ADM 4.43 7.28 26.15 3,75 5.15 ADM-ES 2.49 (b=1.017) 3.68 (b=1.007) 9.50 (b=1.007) 3.07 (b=1.006) 4.33 (b=1.004) 20 ADM 10.36 23.92 59.35 10.96 12.48 ADM-ES 5.15 (b=1.017) 8.22 (b=1.011) 26.14 (b=1.008) 7.52 (b=1.006) 9.95 (b=1.005) ADM-ES 4.31 (k=0.0025, b=1.0) 7.60 (k=0.0008, b=1.0034) 24.83 (k=0.0004, b=1.0042) 7.37 (k=0.0002, b=1.0041) 9.86 (k=0.00022, b=1.00291) A.11 INSENSITIVITY OF λ(t) We empirically show that the FID gain can be achieved within a wide range of λ(t). Table 17 and Table 18 demonstrate the insensitivity of our hyperparameters, which ease the search of λ(t). Published as a conference paper at ICLR 2024 Table 15: Epsilon Scaling schedule λ(t) = kt + b, (k = 0) we used on DDIM/DDPM and LDM baseline. We keep the FID results in the table for comparisons and remark b underneath FIDs T Model CIFAR-10 32 32 Celeb A 64 64 T Model FFHQ 256 256 Celeb A-HQ 256 256 η = 0 η = 1 η = 0 η = 1 100 DDIM 4.06 6.73 5.67 11.33 100 LDM 10.90 9.31 DDIM-ES 3.38 (b=1.0014) 4.01 (b=1.03) 5.05 (b=1.003) 4.45 (b=1.04) LDM-ES 9.83 (b=1.00015) 7.36 (b=1.0009) 50 DDIM 4.82 10.29 6.88 15.09 50 LDM 14.34 13.95 DDIM-ES 4.17 (b=1.0030) 4.57 (b=1.04) 6.20 (b=1.004) 5.57 (b=1.05) LDM-ES 11.57 (b=1.0016) 9.16 (b=1.003) 20 DDIM 8.21 20.15 10.43 22.61 20 LDM 33.13 29.62 DDIM-ES 6.54 (b=1.0052) 7.78 (b=1.05) 10.38 (b=1.001) 11.83 (b=1.06) LDM-ES 20.91 (b=1.007) 15.68 (b=1.010) Table 16: Epsilon Scaling schedule λ(t) = kt + b, (k = 0) we used on EDM baseline. We keep the FID results in the table for comparisons and remark b underneath FIDs T Model Unconditional Conditional Heun Euler Heun Euler 35 EDM 1.97 3.81 1.82 3.74 EDM-ES 1.95 b=1.0005 2.80 b=1.0034 1.80 b=1.0006 2.59 b=1.0035 21 EDM 2.33 6.29 2.17 5.91 EDM-ES 2.24 b=0.9985 4.32 b=1.0043 2.08 b=0.9983 3.74 b=1.0045 13 EDM 7.16 12.28 6.69 10.66 EDM-ES 6.54 b=1.0060 8.39 b=1.0048 6.16 b=1.0070 6.59 b=1.0051 Table 17: FID of ADM-ES achieved on CIFAR-10 under different λ(t) (λ(t) = b) and unconditional sampling, b=1 represents ADM. b 1 (baseline) 1.015 1.016 1.017 1.018 1.019 T = 100 3.37 2.20 2.18 2.17 2.21 2.31 b 1 (baseline) 1.015 1.016 1.017 1.018 1.019 T = 50 4.43 2.53 2.51 2.49 2.53 2.55 Table 18: FID of EDM-ES achieved on CIFAR-10 under different λ(t) (λ(t) = b) and unconditional Heun sampler, b=1 represents EDM. b 1 (baseline) 1.0005 1.0006 1.0007 1.0008 T = 35 1.97 1.948 1.947 1.949 1.953 b 1 (baseline) 1.004 1.005 1.006 1.007 T = 13 7.16 6.60 6.55 6.54 6.55 Published as a conference paper at ICLR 2024 A.12 EPSILON SCALING ALLEVIATES EXPOSURE BIAS In Section 5.2, we have explicitly shown that Epsilon Scaling reduces the exposure bias of diffusion models via refining the sampling trajectory and achieves a lower δt on CIFAR-10 dataset. We now replicate these experiments on other datasets using the same base model ADM and 20-step sampling. Fig. 9 and Fig. 10 display the corresponding results on CIFAR-10 and FFHQ 128 128 datasets. Similar to the phenomenon on LSUN tower 64 64 (Fig. 5 and Fig. 4), Epsilon Scaling consistently obtains a smaller exposure bias δt and pushes the sampling trajectory to the vector field learned in the training stage. 2 4 6 8 10 12 14 16 18 timestep (a) Exposure bias measured by δt on CIFAR-10 0 3 6 9 12 15 18 21 timestep training sampling sampling after Epsilon Scaling (b) L2-norm of ϵθϵθϵθ( ) on CIFAR-10 Figure 9: Left: Epsilon Scaling achieves a smaller δt at the end of sampling (t = 1). Right: after applying Epsilon Scaling, the sampling ϵθ 2 (blue) gets closer to the training ϵθ 2 (red) 2 4 6 8 10 12 14 16 18 timestep (a) Exposure bias measured by δt on FFHQ 128 128 0 3 6 9 12 15 18 21 timestep training sampling sampling after Epsilon Scaling (b) L2-norm of ϵθϵθϵθ( ) on FFHQ 128 128 Figure 10: Left: Epsilon Scaling achieves a smaller δt at the end of sampling (t = 1). Right: after applying Epsilon Scaling, the sampling ϵθ 2 (blue) gets closer to the training ϵθ 2 (red). A.13 FID RESULTS ON DIT BASELINE In addition to UNet (Ronneberger et al., 2015) backbone, we also test Epsilon Scaling on diffusion models using Vision Transformers (Dosovitskiy et al., 2020). Table 19 presents the FID of Di T (Peebles & Xie, 2023) and our Di T-ES on Image Net 256 256 under uniform λ(t) = b. Again, Epsilon Scaling achieves consistent sample quality improvement over different sampling steps, indicating that the exposure bias exists in both UNet and Transformer backbones. Published as a conference paper at ICLR 2024 Table 19: FID on Image Net 256 256 using Di T baseline Di T 12.95 3.71 2.57 Di T-ES 10.00 (b=0.965) 3.30 (b=0.989) 2.52 (b=0.995) A.14 QUALITATIVE COMPARISON In Section 5.6, we have presented the sample quality comparison between the base model sampling and Epsilon Scaling sampling on FFHQ 128 128 dataset. Applying the same experimental settings, we show more qualitative contrasts between ADM and ADM-ES on the dataset CIFAR-10 32 32 (Fig. 11), LSUN tower 64 64 (Fig. 12), Image Net 64 64 (Fig. 13) and Image Net 128 128 (Fig. 14). Also, we provide the qualitative comparison between LDM and LDM-ES on the dataset Celeb A-HQ 256 256 (Fig. 15). These sample comparisons clearly state that Epsilon Scaling effectively improves the sample quality from various perspectives, including illumination, colour, object coherence, background details and so on. Figure 11: Qualitative comparison between ADM (first row) and ADM-ES (second row) on CIFAR10 32 32 using 100-step sampling. Figure 12: Qualitative comparison between ADM (first row) and ADM-ES (second row) on LSUN tower 64 64 using 100-step sampling. Published as a conference paper at ICLR 2024 Figure 13: Qualitative comparison between ADM (first row) and ADM-ES (second row) on Image Net 64 64 using 100-step sampling. Figure 14: Qualitative comparison between ADM (first row) and ADM-ES (second row) on Image Net 128 128 using 100-step sampling. Figure 15: Qualitative comparison between LDM (first row) and LDM-ES (second row) on Celeb AHQ 256 256 using 100-step sampling.