# on_error_propagation_of_diffusion_models__d216bc74.pdf Published as a conference paper at ICLR 2024 ON ERROR PROPAGATION OF DIFFUSION MODELS Yangming Li, Mihaela van der Schaar Department of Applied Mathematics and Theoretical Physics University of Cambridge yl874@cam.ac.uk Although diffusion models (DMs) have shown promising performances in a number of tasks (e.g., speech synthesis and image generation), they might suffer from error propagation because of their sequential structure. However, this is not certain because some sequential models, such as Conditional Random Field (CRF), are free from this problem. To address this issue, we develop a theoretical framework to mathematically formulate error propagation in the architecture of DMs, The framework contains three elements, including modular error, cumulative error, and propagation equation. The modular and cumulative errors are related by the equation, which interprets that DMs are indeed affected by error propagation. Our theoretical study also suggests that the cumulative error is closely related to the generation quality of DMs. Based on this finding, we apply the cumulative error as a regularization term to reduce error propagation. Because the term is computationally intractable, we derive its upper bound and design a bootstrap algorithm to efficiently estimate the bound for optimization. We have conducted extensive experiments on multiple image datasets, showing that our proposed regularization reduces error propagation, significantly improves vanilla DMs, and outperforms previous baselines. 1 INTRODUCTION DMs potentially suffer from error propagation. While diffusion models (DMs) have shown impressive performances in a number of tasks (e.g., image synthesis (Rombach et al., 2022), speech processing (Kong et al., 2021) and natural language generation (Li et al., 2022)), they might still suffer from error propagation (Motter & Lai, 2002), a classical problem in engineering practices (e.g., communication networks) (Fu et al., 2020; Motter & Lai, 2002; Crucitti et al., 2004). The problem mainly affects some chain models that consist of many end-to-end connected modules. For those models, the output error from one module will spread to subsequent modules such that errors accumulate along the chain. Since DMs are exactly of a chain structure and have a large number of modules (e.g., 1000 for DDPM (Ho et al., 2020)), the effect of error propagation on diffusion models is potentially notable and worth a careful study. Current works lack reliable explanations. Some recent works (Ning et al., 2023; Li et al., 2023; Daras et al., 2023) have noticed this problem and named it as exposure bias or sampling drift. However, those works claim that error propagation happens to DMs simply because the models are of a cascade structure. In fact, many sequential models (e.g., CRF (Lafferty et al., 2001) and CTC (Graves et al., 2006)) are free from the problem. Therefore, a solid explanation is expected to answer whether(or even why) error propagation impacts on DMs. Our theory for the error propagation of DMs. One main focus of this work is to develop a theoretical framework for analyzing the error propagation of DMs. With this framework, we can clearly understand how this problem is mathematically formulated in the architecture of DMs and easily see whether it has a significant impact on the models. Our framework contains three elements: modular error, cumulative error, and propagation equation. The first two elements respectively measure the prediction error of one single module and the accumulated error of multiple modules, while the last one tells how these errors are related. We Published as a conference paper at ICLR 2024 first properly define the errors and then derive the equation, which will be applied to explain why error propagation happens to DMs. We will see that it is a term called amplification factor in the propagation equation that determines whether error accumulation happens. Besides the theoretical study, we also perform empirical experiments to verify whether our theory applies to reality. Why and how to reduce error propagation. In our theoretical study, we will see that the cumulative error actually reflects the generation quality of DMs. Hence, if error propagation is very serious in DMs, then we can improve the generation performances by reducing it. A simple way to reducing error propagation is to treat the cumulative error as a regularization term and directly minimize it at training time. However, we will see that this error is computationally infeasible in terms of both density estimation and inefficient backward process. Therefore, we first introduce an upper bound of the error, which avoids estimating densities. Then, we design a bootstrap algorithm (which is inspired by TD learning (Tesauro et al., 1995; Osband et al., 2016)) to efficiently estimate the bound though with a certain bias. Contributions. The contributions of our paper are as follows: We develop a theoretical framework for analyzing the error propagation of DMs, which explains whether the problem affects DMs and why it should be solved. Compared with our framework, previous explanations are just based on an unreliable intuition; In light of our theoretical study, we introduce a regularization term to reduce error propagation. Since the term is computationally infeasible, we derive its upper bound and design a bootstrap algorithm to efficiently estimate the bound for optimization; We have conducted extensive experiments on multiple image datasets, demonstrating that our proposed regularization reduces error propagation, significantly improves the performances of vanilla DMs, and outperforms previous baselines. 2 BACKGROUND: DISCRETE-TIME DMS In this section, we briefly review the mainstream architecture of DMs (i.e., DDPM). The notations and terminologies introduced below will be used in our subsequent sections. A DM consists of two Markov chains of T steps. One is the forward process, which incrementally adds Gaussian noises into real sample x0 qpx0q, a K-dimensional vector. In this process, a sequence of latent variables x1:T rx1, x2, , x T s are generated in order and the last one x T will approximately follow a standard Gaussian: qpx1:T | x0q t 1 qpxt | xt 1q, qpxt | xt 1q Npxt; a 1 βtxt 1, βt Iq, (1) where N denotes a Gaussian distribution, I is an identity matrix, and βt, 1 ď t ď T represents a predefined variance schedule. The other is the backward process, a reverse version of the forward process, which first draws an initial sample x T from standard Gaussian ppx T q Np0, Iq and then gradually denoises it into a sequence of latent variables x T 1:0 rx T 1, x T 2, , x0s in reverse order: pθpx T :0q ppx T q t T pθpxt 1 | xtq, pθpxt 1 | xtq Npxt 1; µθpxt, tq, σt Iq, (2) where pθpxt 1 | xtq is a denoising module with the parameter θ shared across different iterations, and σt I is a predefined covariance matrix. Since the negative log-likelihood Er log pθpx0qs is computationally intractable, common practices apply Jensen s inequality to derive its upper bound for optimization: Ex0 qpx0qr log pθpx0qs ď Eqr DKLpqpx T | x0q || ppx T qqs Eqr log pθpx0 | x1qs 1ďtăT Eqr DKLpqpxt 1 | xt, x0q || pθpxt 1 | xtqqs Lnll , (3) Published as a conference paper at ICLR 2024 𝑝!(𝒙"|𝒙#) Denoising Modules $%&% Cumulative Error Modular Error Propagation Equation ℰ# $%&% 0 + ℰ# Input Error Input Error Figure 1: A toy example (with T 3) to show our theoretical framework for the error propagation of diffusion models. We use dash lines to indicate that the impact of cumulative error Ecumu t 1 on denoising module pθpxt 1 | xtq is defined at the distributional (not sample) level. where DKL denotes the KL divergence and each term has an analytic form for feasible computation. Ho et al. (2020) further applied some reparameterization tricks to the loss Lnll to reduce its estimation variance. For example, the mean µθ is formulated by µθpxt, tq 1 ?αt xt βt ?1 sαt ϵθpxt, tq , (4) in which αt 1 βt, sαt śt t1 1 αt1, and ϵθ is parameterized by a neural network. Under this scheme, the loss Lnll can be simplified as t 1 Ex0 qpx0q,ϵ Np0,Iq }ϵ ϵθp?sαtx0 ? 1 sαtϵ, tq}2ı loooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooon where the neural network ϵθ is tasked to fit Gaussian noise ϵ. 3 THEORETICAL STUDY In this section, we first introduce our theoretical framework and discuss why current explanations (for why error propagation happens to DMs) are not reliable. Then, we seek to define (or derive) the elements of our framework and apply them to give a solid explanation. 3.1 ANALYSIS FRAMEWORK Elements of the framework. Fig. 1 illustrates a preview of our framework. We can see from the preview that we have to find the following three elements from DMs: Modular error Emod t , which measures how accurately every module maps its input to the output. It only relies on the module at iteration t and is independent of the others. Intuitively, the error is non-negative: Emod t ě 0, @t P r1, Ts; Cumulative error Ecumu t , which is also non-negative, measuring the amount of error that can be accumulated for sequentially running the first T t 1 denoising modules. Unlike the modular error, its value is influenced by more than one modules; Propagation equation fp Ecumu 1 , Ecumu 2 , , Ecumu T , Emod 1 , Emod 2 , , Emod T q 0, which tells how the modular and cumulative errors of different iterations are related. It is generally very hard to find such an equation because the model is non-linear. Interpretation of the propagation equation. In a DM, the input error to every module pθpxt 1 | xtq, t P r1, Ts can be regarded as Ecumu t 1 . Since module pθpxt 1 | xtq is parameterized by a nonlinear neural networks, it is not certain whether its input error Ecumu t 1 is magnified or reduced in the Published as a conference paper at ICLR 2024 error contained in its output: Ecumu t . Therefore, we expect the propagation equation to include or just take the form of the following equation: Ecumu t Emod t µt Ecumu t 1 , (6) which specifies how much input error Ecumu t 1 that the module pθpxt 1 | xtq spreads to the subsequent modules. For example, µt ă 1 means the model can discount the input cumulative error Ecumu t 1 , which contributes to avoid error accumulation. We also term µt as the amplification factor, which satisfies µt ě 0 because the cumulative error Ecumu t at least contains the prediction error of the denoising module pθpxt 1 | xtq. With this equation, if we have µt ě 1, @t P r1, Ts, then we can assert that the DM is affected by error propagation because in that case every module pθpxt 1 | xtq either fully spreads or even magnifies its input error to the subsequent modules. Why current explanations are not solid. Current works (Ning et al., 2023; Li et al., 2023; Daras et al., 2023) claim that DMs are affected by error propagation just because they are of a chain structure. Based on our theoretical framework, we can see that this intuition is not reliable. For example, if µt 0, @t P r1, Ts, every module pθpxt 1 | xtq of the DM will fully reduce its input cumulative error Ecumu t 1 , which avoids error propagation. 3.2 ERROR DEFINITIONS In this part, we formally define the modular and cumulative errors, which respectively measure the prediction error of one module and the accumulated errors of multiple successive modules. Derivation of the modular error. In common practices (Ho et al., 2020; Song et al., 2021b), every denoising module in a diffusion model pθpxt 1 | xtq, t P r1, Ts is optimized to exactly match the posterior probability qpxt 1 | xtq. Considering this fact and the forms of loss terms in Eq. (3), we apply KL divergence to measure the discrepancy between those two conditional distributions: DKLppθp q || qp qq, and treat it as the modular error. To further get ride of the dependence on variable xt, we apply an expectation operation Ext pθpxtq to the error. Definition 3.1 (Modular Error). For the forward process as defined in Eq. (1) and the backward process as defined in Eq. (2), the modular error Emod t of every denoising module pθpxt 1 | xtq, t P r1, Ts in a diffusion model is measured as Emod t Ext pθpxtqr DKLppθpxt 1 | xtq || qpxt 1 | xtqqs. (7) Inequality Emod t ě 0 always holds since KL divergence is non-negative. Derivation of the cumulative error. Considering the fact (Ho et al., 2020): qpxt | x0q Npxt; ?sαtx0, p1 sαtq Iq, we can reparameterize the variable xt as ?sαtx0 ?1 sαtϵ, ϵ Np0, Iq. Therefore, the loss term Lnll t in Eq. (5) can be reformulated as Lnll t Ex0 qpx0q,xt qpxt|x0q xt ?sαtx0 ?1 sαt ϵθpxt, tq 2ı , From this equality, we can see that the input xt to the module pθpxt 1 | xtq at training time follows a fixed distribution that is specified by the forward process qpx T :0q: ż x0 qpx0qqpxt | x0qdx0 ż x0 qpxt, x0q qpxtq, (8) which is inconsistent with the input distribution that the module pθpxt 1 | xtq receives from its previous denoising module tpθpxt1 1 | xt1q | t1 P rt 1, Tsu during evaluation: ż x T :t 1 ppx T q i T pθpxi 1 | xiqdx T :t 1 ż x T :t 1 pθpxt, x T :t 1qdx T :t 1 pθpxtq. (9) The discrepancy between these two distributions, qpxtq and pθpxtq, is essentially the input error to the denoising module pθpxt 1 | xtq. Like the modular error, we also apply KL divergence to measure this discrepancy as DKLppθpxtq || qpxtqq. Because the input error of module pθpxt 1 | xtq can be regarded as the accumulated output errors of its all previous modules, we resort to that input error to define the cumulative error. Published as a conference paper at ICLR 2024 Definition 3.2 (Cumulative Error). Given the forward and backward processes that are respectively defined in Eq. (1) and Eq. (2), the cumulative error Ecumu t of the diffusion model at iteration t P r1, Ts (caused by the first T t 1 modules) is measured as Ecumu t DKLppθpxt 1q || qpxt 1qq. (10) Inequality Ecumu t ě 0 always holds because KL divergence is non-negative. Remark 3.1. The error Ecumu t in fact reflects the performance of DMs in data generation. For instance, when t 1, the cumulative error Ecumu 1 DKLppθpx0q || qpx0qq indicates whether the generated samples are distributionally consistent with real data. Therefore, a method to reducing error propagation is expected to improve the performance of DMs. Remark 3.2. In Appendix A, we prove that Ecumu t 0 is achievable in an ideal case: the diffusion model is perfectly optimized (i.e., pθpxt1 1 | xt1q qpxt1 1 | xt1q, @t1 P r1, Ts). 3.3 PROPAGATION EQUATION The propagation equation describes how the modular and cumulative errors of different iterations are related. We derive that equation for the diffusion models as below. Theorem 3.1 (Propagation Equation). For the forward and backward processes respectively defined in Eq. (1) and Eq. (2), suppose that the output of neural network ϵθp q (as defined in Eq. (4)) follows a standard Gaussian regardless of input distributions and the entropy of distribution pθpxtq is nonincreasing with decreasing iteration t, then the following inequality holds: Ecumu t ě Ecumu t 1 Emod t , (11) where t P r1, Ts and we specially set ET 1 0. Remark 3.3. The theorem is based on two main intuitive assumptions: 1) From Eq. (5), we can see that neural network ϵθ aims to fit Gaussian noise ϵ and is shared by all denoising iterations, which means it takes the input from various distributions rqpx T q, qpx T 1q, , qpx1qs. Therefore, it s reasonable to assume that the output of neural network ϵθ follows a standard Gaussian distribution, independent of the input distribution; 2) The backward process is designed to incrementally denoise Gaussian noise x T Np0, Iq into real sample x0. Ideally, the uncertainty (i.e., noise level) of distribution pθpxtq gradually decreases in that denoising process. From this view, it makes sense to assume that differential entropy Hpθpxtq reduces with decreasing iteration t. Proof. We provide a complete proof to this theorem in Appendix C. Are DMs affected by error propagation? By comparing Eq. (11) and Eq. (6), we can see that µt ě 1, @t P r1, Ts. Based on our discussion in Sec. 3.1, we can assert that error propagation happens to standard DMs. The basic idea: µt ě 1 means that the module pθpxt 1 | xtq at least fully spreads the input cumulative error Ecumu t 1 to its subsequent modules. 4 EMPIRICAL STUDY: CUMULATIVE ERROR ESTIMATION Besides the theoretical study, it is also important to empirically verify whether diffusion models are affected by error propagation. Hence, we aim to numerically compute the cumulative error Ecumu t and show how it changes with decreasing iteration t P r1, Ts. By doing so, we can also verify whether our theory (e.g., propgation equation: Eq. (11)) applies to reality. Because the marginal distributions pθpxt 1q, qpxt 1q have no closed-form solutions, the KL divergence between them (i.e., the cumulative error Ecumu t ) is computationally infeasible. To address this problem, we adopt the maximum mean discrepancy (MMD) (Gretton et al., 2012) to estimate the error Ecumu t and prove that it is in fact tightly bounded by MMD. With MMD, we can define another type of cumulative error Dcumu t , t P r1, Ts: Dcumu t Ext 1 pθpxt 1qrϕpxt 1qs Ext 1 qpxt 1qrϕpxt 1qs 2 H, (12) which also measures the gap between pθpxt 1q and qpxt 1q. Here H denotes a reproducing kernel Hilbert space and ϕ : RK Ñ H is a feature map. Following common practices (Dziugaite et al., Published as a conference paper at ICLR 2024 0 200 400 600 800 1000 0.000 Error (CIFAR-10) 0 200 400 600 800 1000 0.000 0 200 400 600 800 1000 0.0000 0 200 400 600 800 1000 Denoising Iteration (Gaussian Kernel) Error (Image Net) 0 200 400 600 800 1000 Denoising Iteration (Laplace Kernel) 0 200 400 600 800 1000 Denoising Iteration (Logistic Kernel) Figure 2: Uptrend dynamics of the MMD error Dcumu t w.r.t. decreasing iteration t. The cumulative error Ecumu t might show similar behaviors since it is tightly bounded by the MMD error. 2015), we adopt an unbiased estimate of the error Dcumu t : Dcumu t 1 N 2 ÿ 1ďi,jďN Kpxback,i t 1 , xback,j t 1 q 1 M 2 ÿ 1ďi,jďM Kpxforw,i t 1 , xforw,j t 1 q 1ďiďN,1ďjďM Kpxback,i t 1 , xforw,j t 1 q , (13) where xback,i t 1 pθpxt 1q, 1 ď i ď N is performed with Eq. (2), xforw,j t 1 qpxt 1q, 1 ď j ď M is done by Eq. (1), and Kp q is the kernel function to simplify the inner products among tϕpxback,1 t 1 q, ϕpxback,2 t 1 q, , ϕpxback,N t 1 q, ϕpxforw,1 t 1 q, ϕpxforw,2 t 1 q, , ϕpxforw,M t 1 uq. Importantly, we show in the following that the cumulative error Ecumu t is tightly bounded by the MMD error Dcumu t from below and above. Proposition 4.1 (Bounds of the Cumulative Error). Suppose that H Ď C8p RKq and the two continuous marginal distributions pθpxt 1q, qpxt 1q are non-zero everywhere, then the cumulative error Ecumu t is linearly bounded by the MMD error Dcumu t from below and above: 1 4Dcumu t ď Ecumu t ď Dcumu t . (14) Here C8p RKq is the set of all continuous functions (with finite uniform norms) over RK. Proof. A complete proof to this proposition is provided in Appendix E. Experiment setup and results. We train standard diffusion models (Ho et al., 2020) on two datasets: CIFAR-10 (32ˆ32) (Krizhevsky et al., 2009) and Image Net (32ˆ32) (Deng et al., 2009). Three different kernel functions (e.g., Laplace kernel) are adopted to compute the error Dcumu t in terms of Eq. (13). Both N and M are set as 1000. As shown in Fig. 2, the MMD error Dcumu t rapidly grows with respect to decreasing iteration t in all settings, implying that the cumulative error Ecumu t also has similar trends. The results not only empirically verify that diffusion models are affected by error propagation, but also indicate that our theory (though built under some assumptions) applies to reality. According to Remark 3.1, the cumulative error in fact reflects the generation quality of DMs. Therefore, we might improve the performances of DMs through reducing error propagation. In light of this finding, we first introduce a basic version of our approach to reduce error propagation, which covers our core idea but is not practical in terms of running time. Then, we extend this approach to practical use through bootstrapping. Published as a conference paper at ICLR 2024 5.1 REGULARIZATION WITH CUMULATIVE ERRORS An obvious way for reducing error propagation is to treat the cumulative error Ecumu t as a regularization term to optimize DMs. Since this term is computationally intractable as mentioned in Sec. 4, we adopt its upper bound Dcumu t based on Proposition 4.1): Lreg t Dcumu t , Lreg t 0 wt Lreg t , (15) where wt 9 exppρ p T tqq, ρ P R and řT 1 t 0 wt 1. We exponentially schedule the weight wt because Fig. 2 suggests that error propagation is more severe as t gets closer to 0. Term Lreg t is estimated by Eq. (13) in practice, with backward variable xback,i t denoised from a Gaussian noise via Eq. (2) and forward variable xforw,j t converted from a real sample with Eq. (1). The new objective Lreg is in addition to the main one Lnll (previously defined in Eq. (5)) for regularizing the optimization of diffusion models. We linearly combine them as L λnll Lnll λreg Lreg, where λnll, λreg P p0, 1q and λnll λreg 1, for joint optimization. 5.2 BOOTSTRAPPING FOR EFFICIENCY A challenge of applying our approach in practice is the inefficient backward process. Specifically, to sample backward variable xback,i t for estimating loss Lreg t , we have to iteratively apply a sequence of T t modules to cast a Gaussian noise into that variable. Inspired by temporal difference (TD) learning (Tesauro et al., 1995), we bootstrap the computation of variable xback,i t for run-time efficiency. To this end, we first sample a time point s ą t from a uniform distribution Utminpt L, Tq, t 1u and apply the forward process to estimate variable xback,i s in a possibly biased manner: rxback,i s ?sαsx0 ? 1 sαsϵ, x0 qpx0q, ϵ Np0, Iq, (16) where operation qpx0q means to sample from training data and L ! T is a predefined sampling length. The above equation is based on a fact (Ho et al., 2020): qpxt | x0q Npxt; ?sαtx0, p1 sαtq Iq. Then, we apply a chain of denoising modules rpθpxs 1 | xsq, pθpxs 2 | xs 1q, , pθpxt | xt 1qs to iteratively update rxback,i s into rxback,i t , which can be treated as an alternative to variable xback,i t . Each update can be formulated as rxback,i k 1 1 ?αk rxback,i k βk ?1 sαk ϵθprxback,i k , kq σkϵ, ϵ Np0, Iq, (17) where t 1 ď k ď s. Finally, we apply the alternative rxback,i t for computing loss Lreg t in terms of Eq. (13), with at most L ! T runs of neural network ϵθ. For specifying the sampling length L, there is actually a trade-off between estimation accuracy and time efficiency For example, large L reduces the negative impact from biased initialization rxback,i s but incurs a high time cost. We will study this problem in the experiment and also put the details of our training procedure in Appendix F. The source code of this work is publicly available at a personal repository: https://github.com/louisli321/epdm, and our lab repository: https://github.com/vanderschaarlab/epdm. 6 EXPERIMENTS We have performed extensive experiments to verify the effectiveness of our proposed regularization. Besides special studies (e.g., effect of some hyper-parameter), our main experiments include: 1) By comparing with Fig. 2, we show that the error propagation is reduced after applying our regularization; 2) On three image datasets (CIFAR-10, Celeb A, and Image Net), our approach notably improves diffusion models and outperforms the baselines. Published as a conference paper at ICLR 2024 0 200 400 600 800 1000 0.000 Error (CIFAR-10) 0 200 400 600 800 1000 0.000 0 200 400 600 800 1000 0.0000 0 200 400 600 800 1000 Denoising Iteration (Gaussian Kernel) Error (Image Net) 0 200 400 600 800 1000 Denoising Iteration (Laplace Kernel) 0 200 400 600 800 1000 Denoising Iteration (Logistic Kernel) Figure 3: Re-estimated dynamics of the MMD error Dcumu t with respect to decreasing iteration t after applying our proposed regularization. These dynamics should be compared with those in Fig. 2, showing that we have well handled error propagation. Approach CIFAR-10 Image Net Celeb A ADM-IP Ning et al. (2023) 3.25 2.72 1.31 DDPM Ho et al. (2020) 3.61 3.62 1.73 DDPM w/ Consistent DM (Daras et al., 2023) 3.31 3.16 1.38 DDPM w/ FP-Diffusion (Lai et al., 2022) 3.47 3.28 1.56 DDPM w/ Our Proposed Regularization 2.93 2.55 1.22 Table 1: FID scores of our model and baselines on different image datasets. The improvements of our approach over baselines are statistically significant with p ă 0.01 under t-test. 6.1 SETTINGS We conduct experiments on three image datasets: CIFAR-10 (Krizhevsky et al., 2009), Image Net (Deng et al., 2009), and Celeb A (Liu et al., 2015), with image shapes respectively as 32 ˆ 32, 32 ˆ 32, and 64 ˆ 64. Following previous works (Ho et al., 2020; Dhariwal & Nichol, 2021), we adopt Frechet Inception Distance (FID) (Heusel et al., 2017) as the evaluation metric. The configuration of our model follows common practices, we adopt U-Net (Ronneberger et al., 2015) as the backbone and respectively set hyper-parameters T, σt, L, λreg, λnll, ρ as 1000, βt, 5, 0.2, 0.8, 0.003. All our model run on 2 4 Tesla V100 GPUs and are trained within two weeks. Baselines. We compare our method with three baselines: ADM-IP Ning et al. (2023), Consistent DM (Daras et al., 2023), and FP-Diffusion (Lai et al., 2022). The first one is under the topic of exposure bias (which is related to our method) and the other two aim to regularize the DMs in terms of their inherent properties (e.g., Fokker-Planck Equation), which are less related to our method. The main idea of ADM-IP is to adds an extra Gaussian noise into the model input, which is not an ideal solution because this simple perturbation certainly can not simulate the complex errors at test time. Table 1 shows that our method significantly outperforms ADM-IP though ADM (Dhariwal & Nichol, 2021) is a stronger backbone than DDPM. 6.2 REDUCED ERROR PROPAGATION To show the effect of our regularization Lreg to diffusion models, we re-estimate the dynamics of MMD error Dcumu t , with the same experiment setup of datasets and kernel functions as our empirical study in Sec. 4. The only difference is that the input distribution to module pθpxt 1 | xtq at training time is no longer qpxtq, but we can still correctly estimate error Dcumu t with Eq. (13) by re-defining xforw,j t as the training-time inputs the module. Fig. 3 shows the results. Compared with the uptrend dynamics of vanilla diffusion models in Fig. 2, we can see that the new ones are more like a slightly fluctuating horizontal line, indicating that: 1) every denoising module pθpxt 1 | xtq has become robust to inaccurate inputs since error measure DMMDptq is not correlated with its location t in the diffusion model; 2) By applying our regularization Lreg, the diffusion model is not affected by error propagation anymore. Published as a conference paper at ICLR 2024 2 3 4 5 6 7 Bootstrapping Steps (CIFAR-10) 2 3 4 5 6 7 Bootstrapping Steps (Image Net) Figure 4: A trade-off study of the hyper-parameter L (i.e., the number of bootstrapping steps) on two datasets. The results show that, as L gets larger, the model performance limitedly increases while the time cost (measured in seconds) boundlessly grows. Approach CIFAR-10 (32 ˆ 32) Celeb A (64 ˆ 64) DDPM w/ Our Regularization 2.93 1.22 Our Model w/o Exponential Schedule wt 3.12 1.35 Our Model w/o Warm-start rxback,i s (L 5) 3.38 1.45 Our Model w/o Warm-start rxback,i s (L 7) 3.15 1.27 Table 2: Ablation studies of the two techniques (i.e., the schedule of loss weight wt and the warmstart initialization of alternative representation rxback,i s ) used in our approach. 6.3 PERFORMANCES OF IMAGE GENERATION The FID scores of our model and baselines on 3 datasets are demonstrated in Table 1. The results of the first row are copied from (Ning et al., 2023) and the others are obtained by our experiments. We draw two conclusions from the table: 1) Our regularization notably improves the performances of DDPM, a mainstream architecture of diffusion models, with reductions of FID scores as 18.83% on CIFAR-10, 29.55% on Image Net, and 29.47% on Celeb A. Therefore, handling error propagation benefits in improving the generation quality of diffusion models; 2) Our model significantly outperforms ADM-IP, the baseline that reduces the exposure bias by input perturbation, on all datasets. For example, our FID score on CIFAR-10 are lower than its score by 9.84%. An important factor contributing to our much better performances is that, unlike the baseline, our approach doesn t impose any assumption on the distribution form of propagating errors. 6.4 TRADE-OFF STUDY The number of bootstrapping steps L is a key hyper-parameter in our proposed regularization. Its value determination involves a trade-off between effectiveness (i.e., the quality of alternative backward variable rxback,i t ) and run-time efficiency. Fig. 4 shows how FID scores (i.e., model performances) and the averaged time costs of one training step change with respect to different bootstrapping steps L. For both CIFAR-10 and Image Net, we can see that, as the number of steps L increases, FID scores decrease and tend to converge while time costs boundlessly grow. In practice, we set L as 5 for our model since this configuration leads to good enough performances and incur relatively low time costs. 6.5 ABLATION EXPERIMENTS To make our regularization Lreg work in practice, we exponentially schedule weight wt, 0 ď t ă T and specially initialize alternative backward variable rxback,i s via Eq. (16). Table 2 demonstrates the ablation studies of these two techniques. For the weighted schedule, we can see from the second row that model performances are notably degraded by replacing it with equal weight wt 1{T. Firstly, by adopting random initialization instead of Eq. (16), the model performances drastically decrease, implying that it s necessary to have a warm start; Secondly, the impact of initialization can be reduced by using a larger number of bootstrapping steps L. Published as a conference paper at ICLR 2024 ACKNOWLEDGMENTS We thank the anonymous ICLR reviewers for their kind and constructive reviews. Yangming Li also thanks Accenture for their sponsorship and support. Sitan Chen, Sinho Chewi, Holden Lee, Yuanzhi Li, Jianfeng Lu, and Adil Salim. The probability flow ode is provably fast. ar Xiv preprint ar Xiv:2305.11798, 2023a. Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview. net/forum?id=zy LVMgs Z0U_. Paolo Crucitti, Vito Latora, and Massimo Marchiori. Model for cascading failures in complex networks. Physical Review E, 69(4):045104, 2004. Giannis Daras, Yuval Dagan, Alexandros G Dimakis, and Constantinos Daskalakis. Consistent diffusion models: Mitigating sampling drift by learning to be consistent. ar Xiv preprint ar Xiv:2302.09057, 2023. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 8780 8794. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/ 49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf. Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. ar Xiv preprint ar Xiv:1505.03906, 2015. Xiuwen Fu, Pasquale Pace, Gianluca Aloi, Lin Yang, and Giancarlo Fortino. Topology optimization against cascading failures on wireless sensor networks using a memetic algorithm. Computer Networks, 177:107327, 2020. Alex Graves, Santiago Fern andez, Faustino Gomez, and J urgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp. 369 376, 2006. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=a-x FK8Ymz5J. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. John Lafferty, Andrew Mc Callum, and Fernando CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001. Published as a conference paper at ICLR 2024 Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, and Stefano Ermon. Regularizing score-based models with score fokker-planck equations. In Neur IPS 2022 Workshop on Score-Based Methods, 2022. URL https://openreview.net/forum?id= Wq W7t C32v8N. Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Naoki Murata, Yuki Mitsufuji, and Stefano Ermon. On the equivalence of consistency-type models: Consistency models, consistent diffusion models, and fokker-planck regularization. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023. URL https://openreview.net/forum?id= wjt Gs Scv AO. Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for general data distributions. In Neur IPS 2022 Workshop on Score-Based Methods, 2022. URL https://openreview.net/forum?id=Sg19A8mu8sv. Mingxiao Li, Tingyu Qu, Wei Sun, and Marie-Francine Moens. Alleviating exposure bias in diffusion models through sampling with shifted time steps. ar Xiv preprint ar Xiv:2305.15583, 2023. Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. Diffusion LM improves controllable text generation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=3s9Ir Esj Lyk. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730 3738, 2015. Adilson E Motter and Ying-Cheng Lai. Cascade-based attacks on complex networks. Physical Review E, 66(6):065102, 2002. Mang Ning, Enver Sangineto, Angelo Porrello, Simone Calderara, and Rita Cucchiara. Input perturbation reduces exposure bias in diffusion models. ar Xiv preprint ar Xiv:2301.11706, 2023. Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. Advances in neural information processing systems, 29, 2016. Marc Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. ICLR, 2016. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684 10695, 2022. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234 241. Springer, 2015. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021a. URL https://openreview.net/ forum?id=St1giar CHLP. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b. URL https://openreview.net/ forum?id=Px TIG12RRHS. Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023. Gerald Tesauro et al. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58 68, 1995. Chong Xiao Wang and Wee Peng Tay. Practical bounds of kullback-leibler divergence using maximum mean discrepancy. ar Xiv preprint ar Xiv:2204.02031, 2022. Published as a conference paper at ICLR 2024 Wen Zhang, Yang Feng, Fandong Meng, Di You, and Qun Liu. Bridging the gap between training and inference for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4334 4343, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1426. URL https: //aclanthology.org/P19-1426. Yufeng Zhang, Wanwei Liu, Zhenbang Chen, Kenli Li, and Ji Wang. On the properties of kullbackleibler divergence between gaussians. ar Xiv preprint ar Xiv:2102.05485, 2021. Published as a conference paper at ICLR 2024 A IDEAL SCENARIO: ZERO CUMULATIVE ERRORS The error Ecumu t , t P r1, Ts defined by KL divergence can be expanded as Ecumu t DKLppθpxt 1q || qpxt 1qq ż xt 1 pθpxt 1q ln pθpxt 1q qpxt 1q dxt 1, (18) which achieves its minimum 0 exclusively at pθpxt 1q qpxt 1q. We show that Ecumu t 0, @t P r0, Ts holds in an ideal situation (i.e., diffusion models are perfectly optimized). Proposition A.1 (Zero Cumulative Errors). A diffusion model is perfectly optimized if pθpxt 1 | xtq qpxt 1 | xtq, @t P r1, Ts. In that case, we have Ecumu t 0, @t P r1, Ts. Proof. It suffices to show pθpxtq qpxtq, @t P r0, Ts for perfectly optimized diffusion models. We apply mathematical induction to prove this assertion. Initially, let s check whether this is true at t T. Based on a nice property of the forward process (Ho et al., 2020): qpxt | x0q Npxt; ?sαtx0, p1 sαtq Iq, we can see that term qpx T | x0q converges to a normal Gaussian Np0, Iq as T Ñ 8 because lim T Ñ8 ?sαT 0. Also note that pθpx T q ppx T q Npx T ; 0, Iq, we have x0 qpx T | x0qqpx0qdx0 Npx T ; 0, Iq ż x0 qpx0qdx0 pθpx T q 1. (19) Therefore, the initial case is proved. Now, we assume the conclusion holds for t. Considering the precondition pθpxt 1 | xtq qpxt 1 | xtq, we have xt pθpxt 1, xtqdxt ż xt pθpxt 1 | xtqpθpxtqdxt xt qpxt 1 | xtqqpxtqdxt ż xt qpxt 1, xtqdxt qpxt 1q , (20) which proves the case for t 1. Finally, the assertion pθpxtq qpxtq is true for every t P r0, Ts, and therefore the whole conclusion is proved. In our proof to this proposition, the use of that precondition T Ñ 8, is to let qpx T |x0q Ñ Npx T ; 0, Iq. Considering the fact that qpxt|x0q Npxt; ? αtx0, ? we can achieve the goal with a much weaker assumption: limtÑT αt Ñ 0. Notably, this new assumption is a standard configuration in current diffusion models (e.g., DDPM (Ho et al., 2020) and SGM (Song et al., 2021b)), which lets x T contain no information about x0. B DEFINITION OF THE MODULAR ERROR The KL divergence is not symmetric, so the modular error can have another definition as Emod t Ext qpxtqr DKLpqpxt 1 | xtq || pθpxt 1 | xtqqs. However, the error propagation happens to the learnable backward process pθ (instead of the predefined forward process q). Hence, it makes more sense to define the module error as Ext pθpxtq DKL pθpxt 1 | xtq || qpxt 1 | xtq ı ż pθpxtq ż pθpxt 1 | xtq ln pθpxt 1 | xtq qpxt 1 | xtq Ext pθpxtq Ext pθpxt 1|xtq ln pθpxt 1 | xtq qpxt 1 | xtq such that expectation integrals are operated on the distributions pθpxtq, pθpxt 1 | xtq of the backward process to average the distribution gap lnp q. For the reversed case, the expectation operations will be mistakenly applied to the distributions qpxtq, qpxt 1 | xtq of the forward process. Published as a conference paper at ICLR 2024 C PROOF TO THEOREM 3.1 The cumulative error Ecumu t can be decomposed into two terms: xt 1 pθpxt 1q ln pθpxt 1q qpxt 1q dxt 1 Hpθpxt 1q Ext 1r ln qpxt 1qs, (21) where the second last term is the entropy of distribution pθpxtq. We denote the last term as Tt 1. According to the law of total expectation, we have Tt 1 Ext pθpxtqr Ext 1 pθpxt 1|xtqr ln qpxt 1qss. (22) Note that qpxt 1qqpxt | xt 1q qpxt 1, xtq qpxtqqpxt 1 | xtq, we have Tt 1 Ext pθpxtq Ext 1 pθpxt 1|xtq ln qpxtq ln qpxt 1 | xtq qpxt | xt 1q Ext pθpxtq ln qpxtq Ext 1 pθpxt 1|xtq ln qpxt 1 | xtq qpxt | xt 1q Tt Ext pθpxtq Ext 1 pθpxt 1|xtq ln qpxt | xt 1q qpxt 1 | xtq Now, we focus on the iterated expectations Ext 1r Extr ss of the above equality. Based on the definition of KL divergence, it s obvious that Ext pθpxtq Ext 1 pθpxt 1|xtq ln pθpxt 1 | xtq qpxt 1 | xtq qpxt | xt 1q pθpxt 1 | xtq Ext pθpxtq Ext 1 pθpxt 1|xtq ln pθpxt 1 | xtq qpxt 1 | xtq ı Ext 1 ln qpxt | xt 1q pθpxt 1 | xtq Ext pθpxtqr DKLp qs Ext pθpxtq Ext 1 pθpxt 1|xtq ln qpxt | xt 1q pθpxt 1 | xtq We denote term Extr Ext 1 pθpxt 1|xtqr ss here as It and prove that it is a constant. Since the distribution qpxt | xt 1q is a predefined multivariate Gaussian, we have qpxt | xt 1q p2πβtq K 2 exp }xt ?1 βtxt 1}2 2 p2πβt{p1 βtqq K 2 exp }xt 1 pxt{?1 βtq}2 2 Npxt 1; xt{ a 1 βt, βt{p1 βtq Iq With this result and the definition of learnable backward probability pθpxt 1 | xtq, we can convert the constant It into the following form: It Ext Ext 1 ln Npxt 1; µθpxt, tq, σt Iq Npxt 1; xt{?1 βt, βt{p1 βtq Iq 2 lnp1 βtq. (26) Note that term Ext 1r s essentially represents the KL divergence between two Gaussian distributions, which is said to have a closed-form solution (Zhang et al., 2021): DKL N xt 1; µθpxt, tq, σt I || N xt 1; xt ?1 βt , βt 1 βt I K ln βt p1 βtqσt K p1 βtqσt xt ?1 βt µθp q 2 . Considering this result, the fact that variance σt is primarily set as βt in DDPM (Ho et al., 2020), and the definition of mean µθ, we can simplify term It as 2βt Ext pθpxtq xt ?1 βt µθpxt, tq 2ı βt βt 2Kp1 sαtq Ext pθpxtqr}ϵθpxt, tq}2s βt Published as a conference paper at ICLR 2024 Considering our assumption that neural network ϵθ behaves as a standard Gaussian irrespective of the input distribution, we can interpret term Ext pθpxtqr s as the total variance of all dimensions of a multivariate normal distribution. Therefore, its value is K ˆ 1, the sum of K unit variances. With this result, term It can be reduced as 1 1 sαt 1 βtsαt 1 sαt ą 0. (28) Combining this inequality with Eq. (23) and Eq. (24), we have Tt 1 Tt Ext pθpxtqr DKLppθpxt 1 | xtq || qpxt 1 | xtqqs It Tt Emod t It, (29) and its derivation Tt 1 ě Tt Emod t are both true. Considering the assumption that the entropy of distribution pθpxtq decreases during the backward process (i.e., Hpθpxt 1q ă Hpθpxtq, 1 ď t ď T), we have Ecumu t ě Hpθpxtq Tt Emod t Ecumu t 1 Emod t , (30) which is exactly the propagation equation. D WEAKENED ASSUMPTION TO THEOREM 3.1 Since neural network ϵθpxt, tq is designed to fit Gaussian noise ϵ in the loss function Lnll, , we previously supposed that its output distribution follows a standard Gaussian. In our proof to Theorem 3.1, that assumption is applied to indicate that Er|ϵθpxt, tq|2s K. (K is the vector dimension). However, considering Eq. (27) and Eq. (28) in the proof, our theorem still holds for Er|ϵθpxt, tq|2s ě K. To derive a new assumption, we first set a term rt Er|ϵ ϵθp? αtx0 ? 1 αtϵ, tq|2s, (31) which indicates the prediction error of neural network ϵθpxt, tq. The term will vanish to 0 if the neural network is fully accurate in backward denoising. According to the Triangle Inequality, we then have rt ě Er|ϵ|2 |ϵθp q|2s Er|ϵ|2s Er|ϵθp q|2s. (32) Since the second moment of Gaussian distribution Er|ϵ|2s is K, we further have p1{Kq Er|ϵθp? αtx0 ? 1 αtϵ, tq|2s ě 1 prt{Kq. (33) Finally, consider the fact (Ho et al., 2020) that xt can be reparameterized as ? αtx0 ?1 αtϵ, ϵ Np0, Iq and let the prediction error rt vanish, the above equation motivates us to make the following new assumption: p1{Kq Er|ϵθpxt, tq|2s ě 1, (34) which is much weaker than the previous assumption but still makes our Theorem 3.1 hold. E PROOF TO PROPOSITION 4.1 Theorem 3 of Wang & Tay (2022) indicates that the following equation is true: 4MMD2p C8pΩq, P, Qqq ď DKLp P || Qq ď lnp MMD2p CpΩ, Qq, P, Qq 1q, (35) if with the Assumption 1 of Wang & Tay (2022) and d P d Q is continuous on Ω. By setting Ω RK, Pp q ş pθpxt 1qdxt 1, Qp q ş qpxt 1qdxt 1 and supposing that pθpxt 1q, qpxt 1q are smooth and non-zero everywhere, the two conditions of Eq. (35) are both met. Furthermore, Wang & Tay (2022) stated that C8pΩq is a subset of CpΩ, Qq. Considering the definitions of the MMD and cumulative errors, we can turn Eq. (35) into the following equation: 4Dcumu t q ď Ecumu t ď lnp Dcumu t 1q. (36) Published as a conference paper at ICLR 2024 Since lnpx 1q ď x, lnp1 1 4, we can get $ & Ecumu t ď lnp Dcumu t 1q ď Dcumu t Ecumu t ě lnp1 1 4Dcumu t q ě 1 4Dcumu t , (37) which are exactly our expected conclusion. Strictly speaking, we also have to verify that Dcumu t ă 4 such that the inequality lnp1 1 4 is properly applied. Recall that the definition of Dcumu t as |Epθpxt 1qrϕpxt 1qs Eqpxt 1qrϕpxt 1qs|2, where |ϕ| supx |ϕpxq| ă 1. According to the Triangle Inequality, we have, Dcumu t ď p|Epθpxt 1qrϕpxt 1qs| |Eqpxt 1qrϕpxt 1qs|q2 (38) Since |x| is a convex function, we can apply Jensen s inequality to the above equation: Dcumu t ď p Epθpxt 1qr|ϕpxt 1q|s Eqpxt 1qr|ϕpxt 1q|sq2 ă p Epθpxt 1qr1s Eqpxt 1qr1sq2 p1 1q2 4, (39) showing that our claim holds. F TRAINING ALGORITHM Compared with common practices (Ho et al., 2020; Song et al., 2021a), we additionally regularize the optimization of DMs with cumulative errors. The details are in Algo. 1. Algorithm 1: Optimization with Our Proposed Regularization Input: Batch size B, number of backward iterations T, sample range L ! T. while the model is not converged do Sample a batch of samples from the training set S0 txi 0 | xi 0 qpxi 0q, 1 ď i ď Tu. Sample a time point for vanilla training t P Ut1, Tu. Estimate training loss Lnll t for every real sample xi 0 P S0. Sample a time point for regularization s P Utminpt L, Tq, t 1u. Sample a new batch of samples from the training set S1 0 txj 0 | xj 0 qpxj 0q, 1 ď j ď Tu. Sample forward variables Sforw s txforw,j s | xforw,j s qpxforw,j s | xj 0q, xj 0 P S1 0u at step s. Sample forward variables Sforw t txforw,i t | xforw,i t qpxforw,i t | xi 0q, xi 0 P S0u at step t. Sample alternative backward variables at time step s as Sback s trxback,i s | rxback,i s pθprxback,i s | xforw,i t q, xi 0 P Sforw t u. Estimate cumulative error Lreg s based on Sforw s and Sback s . Update the model parameter θ with gradient θpλnll Lnll t λregws Lreg s q. G RELATED WORK Exposure bias. The topic of our paper is closely related to a problem called exposure bias that occurs to some sequence models (Ranzato et al., 2016; Zhang et al., 2019), which means that a model only exposed to ground truth inputs might not be robust to errors during evaluation. (Ning et al., 2023; Li et al., 2023) studied this problem for diffusion models due to their sequential structure. However, they lack a solid explanation of why the models are not robust to exposure bias, which is very important because many sequence models (e.g., CRF) is free from this problem. Our analysis in Sec. 3 actually answers that question with empirical evidence and theoretical analysis. More importantly, we argue that ADM-IP (Ning et al., 2023), which adds an extra Gaussian noise into the model input, is not an ideal solution, since this simple perturbation certainly can not simulate the complex errors at test time. We have also treated their approach as a baseline and shown that our regularization significantly outperforms it in experiments (Section 6). Published as a conference paper at ICLR 2024 Consistency regularizations. Some papers (Lai et al., 2022; Daras et al., 2023; Song et al., 2023; Lai et al., 2023), which aim to improve the estimation of score functions. One big difference between these papers and our work is that they assert without proof that diffusion models are affected by error propagation, while we have empirically and theoretically verified whether this phenomenon happens. Notably, this assertion is not trivial because many sequential models (e.g., CRF and HMM) are free from error propagation. Besides, these papers proposed regularisation methods in light of some expected consistencies. While these methods are similar to our regularisation in name, they were actually to improve the score estimation at every time step (i.e., the prediction accuracy of every component in a sequential model), which differs much from our approach that aims to improve the robustness of score functions to input errors. The key to solving error propagation is to make score functions insensitive to the input cumulative errors (rather than pursuing higher accuracy) (Motter & Lai, 2002; Crucitti et al., 2004), because it s very hard to have perfect score functions with limited training data. Therefore, these methods are orthogonal to us in reducing the effect of error propagation and do not constitute ideal solutions to the problem. Convergence guarantees. Another group of seemingly related papers (Lee et al., 2022; Chen et al., 2023b;a) aim to derive convergence guarantees based on the assumption of bounded score estimation errors. Specifically speaking, these papers analysed how generated samples converge (in distribution) to real data with respect to increasing discretization levels. However, studies (Motter & Lai, 2002; Crucitti et al., 2004) on error propagation generally focus on analysing the error dynamics of a chain model over time (i.e., how input errors to the score functions of decreasing time steps evolve). Therefore, these papers are of a very different research theme from our paper. More notably, these papers either assumed bounded score estimation errors or just ignored them, which are actually not appropriate to analyse error propagation for diffusion models. There are two reasons. Firstly, the error dynamics shown in our paper (Fig. 2) have exponentially-like increasing trends, implying that the score functions close to time step 0 are of very poor estimation. Secondly, if all components of a chain model are very accurate (i.e., bounded estimation errors), then the effect of error propagation will be insignificant regardless of the values of amplification factor. Therefore, it s better to set the estimation errors as uncertain variables for studying error propagation