# gddim_generalized_denoising_diffusion_implicit_models__bfab55c2.pdf Published as a conference paper at ICLR 2023 GDDIM: GENERALIZED DENOISING DIFFUSION IMPLICIT MODELS Qinsheng Zhang Georgia Institute of Technology qzhang419@gatech.edu Molei Tao Georgia Institute of Technology mtao@gatech.edu Yongxin Chen Georgia Institute of Technology yongchen@gatech.edu Our goal is to extend the denoising diffusion implicit model (DDIM) to general diffusion models (DMs) besides isotropic diffusions. Instead of constructing a non-Markov noising process as in the original DDIM, we examine the mechanism of DDIM from a numerical perspective. We discover that the DDIM can be obtained by using some specific approximations of the score when solving the corresponding stochastic differential equation. We present an interpretation of the accelerating effects of DDIM that also explains the advantages of a deterministic sampling scheme over the stochastic one for fast sampling. Building on this insight, we extend DDIM to general DMs, coined generalized DDIM (g DDIM), with a small but delicate modification in parameterizing the score network. We validate g DDIM in two non-isotropic DMs: Blurring diffusion model (BDM) and Critically-damped Langevin diffusion model (CLD). We observe more than 20 times acceleration in BDM. In the CLD, a diffusion model by augmenting the diffusion process with velocity, our algorithm achieves an FID score of 2.26, on CIFAR10, with only 50 number of score function evaluations (NFEs) and an FID score of 2.86 with only 27 NFEs. Project page and code: https://github.com/qshzh/g DDIM. 1 INTRODUCTION Generative models based on diffusion models (DMs) have experienced rapid developments in the past few years and show competitive sample quality compared with generative adversarial networks (GANs) (Dhariwal & Nichol, 2021; Ramesh et al.; Rombach et al., 2021), competitive negative log likelihood compared with autoregressive models in various domains and tasks (Song et al., 2021; Kawar et al., 2021). Besides, DMs enjoy other merits such as stable and scalable training, and mode-collapsing resiliency (Song et al., 2021; Nichol & Dhariwal, 2021). However, slow and expensive sampling prevents DMs from further application in more complex and higher dimension tasks. Once trained, GANs only forward pass neural networks once to generate samples, but the vanilla sampling method of DMs needs 1000 or even 4000 steps (Nichol & Dhariwal, 2021; Ho et al., 2020; Song et al., 2020b) to pull noise back to the data distribution, which means thousands of neural networks forward evaluations. Therefore, the generation process of DMs is several orders of magnitude slower than GANs. How to speed up sampling of DMs has received significant attention. Building on the seminal work by Song et al. (2020b) on the connection between stochastic differential equations (SDEs) and diffusion models, a promising strategy based on probability flows (Song et al., 2020b) has been developed. The probability flows are ordinary differential equations (ODE) associated with DMs that share equivalent marginal with SDE. Simple plug-in of off-the-shelf ODE solvers can already achieve significant acceleration compared to SDEs-based methods (Song et al., 2020b). The arguably most popular sampling method is denoising diffusion implicit model (DDIM) (Song et al., 2020a), which includes both deterministic and stochastic samplers, and both show tremendous im- Published as a conference paper at ICLR 2023 provement in sampling quality compared with previous methods when only a small number of steps is used for the generation. Although significant improvements of the DDIM in sampling efficiency have been observed empirically, the understanding of the mechanism of the DDIM is still lacking. First, why does solving probability flow ODE provide much higher sample quality than solving SDEs, when the number of steps is small? Second, it is shown that stochastic DDIM reduces to marginal-equivalent SDE (Zhang & Chen, 2022), but its discretization scheme and mechanism of acceleration are still unclear. Finally, can we generalize DDIMs to other DMs and achieve similar or even better acceleration results? In this work, we conduct a comprehensive study to answer the above questions, so that we can generalize and improve DDIM. We start with an interesting observation that the DDIM can solve corresponding SDEs/ODE exactly without any discretization error in finite or even one step when the training dataset consists of only one data point. For deterministic DDIM, we find that the added noise in perturbed data along the diffusion is constant along an exact solution of probability flow ODE (see Prop 1). Besides, provided only one evaluation of log density gradient (a.k.a. score), we are already able to recover accurate score information for any datapoints, and this explains the acceleration of stochastic DDIM for SDEs (see Prop 3). Based on this observation, together with the manifold hypothesis, we present one possible interpretation to explain why the discretization scheme used in DDIMs is effective on realistic datasets (see Fig. 2). Equipped with this new interpretation, we extend DDIM to general DMs, which we coin generalized DDIM (g DDIM). With only a small but delicate change of the score model parameterization during sampling, g DDIM can accelerate DMs based on general diffusion processes. Specifically, we verify the sampling quality of g DDIM on Blurring diffusion models (BDM) (Hoogeboom & Salimans, 2022; Rissanen et al., 2022) and critically-damped Langevin diffusion (CLD) (Dockhorn et al., 2021) in terms of Fr echet inception distance (FID) (Heusel et al., 2017). To summarize, we have made the following contributions: 1) We provide an interpretation for the DDIM and unravel its mechanism. 2) The interpretation not only justifies the numerical discretization of DDIMs but also provides insights into why ODE-based samplers are preferred over SDEbased samplers when NFE is low. 3) We propose g DDIM, a generalized DDIM that can accelerate a large class of DMs deterministically and stochastically. 4) We show by extensive experiments that g DDIM can drastically improve sampling quality/efficiency almost for free. Specifically, when applied to CLD, g DDIM can achieve an FID score of 2.86 with only 27 steps and 2.26 with 50 steps. g DDIM has more than 20 times acceleration on BDM compared with the original samplers. The rest of this paper is organized as follows. In Sec. 2 we provide a brief inntroduction to diffusion models. In Sec. 3 we present an interpretation of the DDIM that explains its effectiveness in practice. Built on this interpretation, we generalize DDIM for general diffusion models in Sec. 4. 2 BACKGROUND In this section, we provide a brief introduction to diffusion models (DMs). Most DMs are built on two diffusion processes in continuous-time, one forward diffusion known as the noising process that drives any data distribution to a tractable distribution such as Gaussian by gradually adding noise to the data, and one backward diffusion known as the denoising process that sequentially removes noise from noised data to generate realistic samples. The continuous-time noising and denoising processes are modeled by stochastic differential equations (SDEs) (S arkk a & Solin, 2019). In particular, the forward diffusion is a linear SDE with state u(t) RD du = Ftudt + Gtdw, t [0, T] (1) where Ft, Gt RD D represent the linear drift coefficient and diffusion coefficient respectively, and w is a standard Wiener process. When the coefficients are piece-wise continuous, Eq. (1) admits a unique solution (Oksendal, 2013). Denote by pt(u) the distribution of the solutions {u(t)}0 t T (simulated trajectories) to Eq. (1) at time t, then p0 is determined by the data distribution and p T is a (approximate) Gaussian distribution. That is, the forward diffusion Eq. (1) starts as a data sample and ends as a Gaussian random variable. This can be achieved with properly chosen coefficients Ft, Gt. Thanks to linearity of Eq. (1), the transition probability pst(u(t)|u(s)) from u(s) to u(t) is a Gaussian distribution. For convenience, denote p0t(u(t)|u(0)) by N(µtu(0), Σt) where µt, Σt RD D. Published as a conference paper at ICLR 2023 Lt EI-Euler Rt EI-Euler Rt EI-Multistep Figure 1: Importance of Kt for score parameterization sθ(u, t) = K T t ϵθ(u, t) and acceleration of diffusion sampling with probability flow ODE. Trajectories of probability ODE for CLD (Dockhorn et al., 2021) at random pixel locations (Left). Pixel value and output of ϵθ in v channel with choice Kt = Lt (Dockhorn et al., 2021) along the trajectory (Mid). Output of ϵθ in x, v channels with our choice Rt (Right). The smooth network output along trajectories enables large stepsize and thus sampling acceleration. g DDIM based on the proper parameterization of Kt can accelerate more than 50 times compared with the naive Euler solver (Lower row). The backward process from u(T) to u(0) of Eq. (1) is the denoising process. It can be characterized by the backward SDE simulated in reverse-time direction (Song et al., 2020b; Anderson, 1982) du = [Ftudt Gt GT t log pt(u)]dt + Gtd w, (2) where w denotes a standard Wiener process running backward in time. Here log pt(u) is known as the score function. When Eq. (2) is initialized with u(T) p T , the distribution of the simulated trajectories coincides with that of the forward diffusion Eq. (1). Thus, u(0) of these trajectories are unbiased samples from p0; the backward diffusion Eq. (2) is an ideal generative model. In general, the score function log pt(u) is not accessible. In diffusion-based generative models, a time-dependent network sθ(u, t), known as the score network, is used to fit the score log pt(u). One effective approach to train sθ(u, t) is the denoising score matching (DSM) technique (Song et al., 2020b; Ho et al., 2020; Vincent, 2011) that seeks to minimize the DSM loss Et U[0,T ]Eu(0),u(t)|u(0)[ log p0t(u(t)|u(0)) sθ(u(t), t) 2 Λt], (3) where U[0, T] represents the uniform distribution over the interval [0, T]. The time-dependent weight Λt is chosen to balance the trade-off between sample fidelity and data likelihood of learned generative model (Song et al., 2021). It is discovered in Ho et al. (2020) that reparameterizing the score network by sθ(u, t) = K T t ϵθ(u, t) (4) with Kt KT t = Σt leads to better sampling quality. In this parameterization, the network tries to predict directly the noise added to perturb the original data. Invoking the expression N(µtu(0), Σt) of p0t(u(t)|u(0)), this parameterization results in the new DSM loss L(θ) = Et U[0,T ]Eu(0) p0,ϵ N(0,ID)[ ϵ ϵθ(µtu(0) + Ktϵ, t) 2 K 1 t Λt K T t ]. (5) Sampling: After the score network sθ is trained, one can generate samples via the backward SDE Eq. (2) with a learned score, or the marginal equivalent SDE/ODE (Song et al., 2020b; Zhang & Published as a conference paper at ICLR 2023 Chen, 2021; 2022) du = [Ftu 1 + λ2 2 Gt GT t sθ(u, t)]dt + λGtdw, (6) where λ 0 is a free parameter. Regardless of the value of λ, the exact solutions to Eq. (6) produce unbiased samples from p0(u) if sθ(u, t) = log pt(u) for all t, u. When λ = 1, Eq. (6) reduces to reverse-time diffusion in Eq. (2). When λ = 0, Eq. (6) is known as the probability flow ODE (Song et al., 2020b) du = [Ftu 1 2Gt GT t sθ(u, t)]dt. (7) Isotropic diffusion and DDIM: Most existing DMs are isotropic diffusions. A popular DM is Denoising diffusion probabilistic modeling (DDPM) (Ho et al., 2020). For a given data distribution pdata(x), DDPM has u = x Rd and sets p0(u) = pdata(x). Though originally proposed in the discrete-time setting, it can be viewed as a discretization of a continuous-time SDE with parameters dt Id, Gt := for a decreasing scalar function αt satisfying α0 = 1, αT = 0. Here Id represents the identity matrix of dimension d. For this SDE, Kt is always chosen to be 1 αt Id. The sampling scheme proposed in DDPM is inefficient; it requires hundreds or even thousands of steps, and thus number of score function evaluations (NFEs), to generate realistic samples. A more efficient alternative is the Denoising diffusion implicit modeling (DDIM) proposed in Song et al. (2020a). It proposes a different sampling scheme over a grid {ti} x(ti 1) = rαti 1 αti x(ti) + ( q 1 αti 1 σ2 ti p αti )ϵθ(x(ti), ti) + σtiϵ, (9) where {σti} are hyperparameters and ϵ N(0, Id). DDIM can generate reasonable samples within 50 NFEs. For the special case where σti = 0, it is recently discovered in Zhang & Chen (2022) that Eq. (9) coincides with the numerical solution to Eq. (7) using an advanced discretization scheme known as the exponential integrator (EI) that utilizes the semi-linear structure of Eq. (7). CLD and BDM: Dockhorn et al. (2021) propose critically-dampled Langevin diffusion (CLD), a DM based on an augmented diffusion with an auxiliary velocity term. More specifically, the state of the diffusion in CLD is of the form u(t) = [x(t), v(t)] R2d with velocity variable v(t) Rd. The CLD employs the forward diffusion Eq. (1) with coefficients Ft := 0 βM 1 Id, Gt := 0 0 0 ΓβM 1 Here Γ > 0, β > 0, M > 0 are hyperparameters. Compared with most other DMs such as DDPM that inject noise to the data state x directly, the CLD introduces noise to the data state x through the coupling between v and x as the noise only affects the velocity component v directly. Another interesting DM is Blurring diffusion model (BDM) (Hoogeboom & Salimans, 2022). It can be shown the forward process in BDM can be formulated as a SDE with (Detailed derivation in App. B) Ft := d log[V αt V T ] dσ2 t dt Ftσ2 t σ2 t Ft, (11) where V T denotes a Discrete Cosine Transform (DCT) and V denotes the Inverse DCT. Diagonal matrices αt, σt are determined by frequencies information and dissipation time. Though it is argued that inductive bias in CLD and BDM can benefit diffusion model (Dockhorn et al., 2021; Hoogeboom & Salimans, 2022), non-isotropic DMs are not easy to accelerate. Compared with DDPM, CLD introduces significant oscillation due to x-v coupling while only inefficient ancestral sampling algorithm supports BDM (Hoogeboom & Salimans, 2022). 3 REVISIT DDIM: GAP BETWEEN THE EXACT SOLUTION AND NUMERICAL SOLUTION The complexity of sampling from a DM is proportional to the NFEs used to numerically solve Eq. (6). To establish a sampling algorithm with a small NFEs, we ask the bold question: Published as a conference paper at ICLR 2023 Can we generate samples exactly from a DM with finite steps if the score function is precise? To gain some insights into this question, we start with the simplest scenario where the training dataset consists of only one data point x0. It turns out that accurate sampling from diffusion models on this toy example is not that easy, even if the exact score function is accessible. Most well-known numerical methods for Eq. (6), such as Runge Kutta (RK) for ODE, Euler-Maruyama (EM) for SDE, are accompanied by discretization error and cannot recover the single data point in the training set unless an infinite number of steps are used. Surprisingly, DDIMs can recover the single data point in this toy example in one step. Built on this example, we show how the DDIM can be obtained by solving the SDE/ODE Eq. (6) with proper approximations. The effectiveness of DDIM is then explained by justifying the usage of those approximations for general datasets at the end of this section. ODE sampling We consider the deterministic DDIM, that is, Eq. (9) with σti = 0. In view of Eq. (8), the score network Eq. (4) is sθ(u, t) = ϵθ(u,t) 1 αt . To differentiate between the learned score and the real score, denote the ground truth version of ϵθ by ϵGT. In our toy example, the following property holds for ϵGT. Proposition 1. Assume p0(u) is a Dirac distribution. Let u(t) be an arbitrary solution to the probability flow ODE Eq. (7) with coefficient Eq. (8) and the ground truth score, then ϵGT(u(t), t) = 1 αt log pt(u(t)) remains constant, which is log p T (u(T)), along u(t). We remark that even though ϵGT(u(t), t) remains constant along an exact solution, the score log pt(u(t)) is time-varying. This underscores the advantage of the parameterization ϵθ over sθ. Inspired by Prop 1, we devise a sampling algorithm as follows that can recover the exact data point in one step for our toy example. This algorithm turns out to coincide with the deterministic DDIM. Proposition 2. With the parameterization sθ(u, τ) = ϵθ(u,τ) 1 ατ and the approximation ϵθ(u, τ) ϵθ(u(t), t) for τ [t t, t], the solution to the probability flow ODE Eq. (7) with coefficient Eq. (8) is u(t t) = rαt t αt u(t) + ( p αt )ϵθ(u(t), t), (12) which coincides with deterministic DDIM. When ϵθ = ϵGT as is the case in our toy example, there is no approximation error in Prop 2 and Eq. (12) is precise. This implies that deterministic DDIM can recover the training data in one step in our example. The update Eq. (12) corresponds to a numerical method known as the exponential integrator to the probability flow ODE Eq. (7) with coefficient Eq. (8) and parameterization sθ(u, τ) = ϵθ(u,τ) 1 ατ . This strategy is used and developed recently in Zhang & Chen (2022). Prop 1 and toy experiments in Fig. 2 provide sights on why such a strategy should work. SDE sampling The above discussions however do not hold for stochastic cases where λ > 0 in Eq. (6) and σti > 0 in Eq. (9). Since the solutions to Eq. (6) from t = T to t = 0 are stochastic, neither log pt(u(t)) nor ϵGT(u(t), t) remains constant along sampled trajectories; both are affected by the stochastic noise. The denoising SDE Eq. (6) is more challenging compared with the probability ODE since it injects additional noise to u(t). The score information needs to remove not only noise presented in u(T) but also injected noise along the diffusion. In general, one evaluation of ϵθ(u, t) can only provide the information to remove noise in the current state u; it cannot predict the future injected noise. Can we do better? The answer is affirmative on our toy dataset. Given only one score evaluation, it turns out that score at any point can be recovered. Proposition 3. Assume SDE coefficients Eq. (8) and that p0(u) is a Dirac distribution. Given any evaluation of the score function log ps(u(s)), one can recover log pt(u) for any t, u as log pt(u) = 1 αs αs log ps(u(s)) 1 1 αt (u r αt αs u(s)). (13) The major difference between Prop 3 and Prop 1 is that Eq. (13) retains the dependence of the score over the state u. This dependence is important in canceling the injected noise in the denoising SDE Published as a conference paper at ICLR 2023 Figure 2: Manifold hypothesis and Dirac distribution assumption. We model an image dataset as a mixture of well-separated Dirac distribution and visualize the diffusion process on the left. Curves in red indicate high density area spanned by p0t(u(t)|u(0)) by different mode and region surrounded by them indicates the phase when pt(u) is dominated by one mode while region surrounded by blue one is for the mixing phase, and green region indicates fully mixed phase. On the right, sampling trajectories depict smoothness of ϵGT along ODE solutions, which justifies approximations used in DDIM and partially explains its empirical acceleration. Eq. (6). This approximation Eq. (13) turns out to lead to a numerical scheme for Eq. (6) that coincide with the stochastic DDIM. Theorem 1. Given the parameterization sθ(u, τ) = ϵθ(u,τ) 1 ατ and the approximation sθ(u, τ) 1 αt 1 ατ q αt sθ(u(t), t) 1 1 ατ (u q αt u(t)) for τ [t t, t], the exact solution u(t t) to Eq. (6) with coefficient Eq. (8) is u(t t) N( rαt t αt u(t) + rαt t 1 αt t σ2 t ϵθ(u(t), t), σ2 t Id) with σt = (1 αt t) 1 1 αt t λ2 , which is the same as the DDIM Eq. (9). Note that Thm 1 with λ = 0 agrees with Prop 2; both reproduce the deterministic DDIM but with different derivations. In summary, DDIMs can be derived by utilizing local approximations. Justification of Dirac approximation While Prop 1 and Prop 3 require the strong assumption that the data distribution is a Dirac, DDIMs in Prop 2 and Thm 1 work very effectively on realistic datasets, which may contain millions of datapoints (Nichol et al., 2021). Here we present one possible interpretation based on the manifold hypothesis (Roweis & Saul, 2000). It is believed that real-world data lie on a low-dimensional manifold (Tenenbaum et al., 2000) embedded in a high-dimensional space and the data points are well separated in high-dimensional data space. For example, realistic images are scattered in pixel space and the distance between every two images can be very large if measured in pixel difference even if they are similar perceptually. To model this property, we consider a dataset consisting of M datapoints {u(m)}M m=1. The exact score is log pt(u) = X m wm log p0t(u|u(m)), wm = p0t(u|u(m)) P m p0t(u|u(m)), (15) which can be interpreted as a weighted sum of M score functions associated with Dirac distributions. This is illustrated in Fig. 2. In the red color region where the weights {wm} are dominated by one specific data u(m ) and thus log pt(u) log p0t(u|u(m )). Moreover, in the green region Published as a conference paper at ICLR 2023 different modes have similar log p0t(u|u(m)) as all of them are close to Gaussian and can be approximated by any condition score of any mode. The {ϵGT(u(t), t)} trajectories in Fig. 2 validate our hypothesis as we have very smooth curves at the beginning and ending period. The phenomenon that score of realistic datasets can be locally approximated by the score of one datapoint partially justifies the Dirac distribution assumption in Prop 1 and 3 and the effectiveness of DDIMs. 4 GENERALIZE AND IMPROVE DDIM The DDIM is specifically designed for DDPMs. Can we generalize it to other DMs? With the insights in Prop 1 and 3, it turns out that with a carefully chosen Kτ, we can generalize DDIMs to any DMs with general drift and diffusion. We coin the resulted algorithm the Generalized DDIM (g DDIM). 4.1 DETERMINISTIC GDDIM WITH PROP 1 Toy dataset: Motivated by Prop 1, we ask whether there exists an ϵGT that remains constant along a solution to the probability flow ODE Eq. (7). We start with a special case with initial distribution p0(u) = N(u0, Σ0). It turns out that any solution to Eq. (7) is of the form u(t) = Ψ(t, 0)u0 + Rtϵ (16) with a constant ϵ and a time-varying parameterization coefficients Rt RD D that satisfies R0RT 0 = Σ0 and d Rt dt = (Ft + 1 2Gt GT t Σ 1 t )Rt. (17) Here Ψ(t, s) is the transition matrix associated with Fτ; it is the solution to Ψ(t,s) t = FtΨ(t, s), Ψ(s, s) = ID. Interestingly, Rt satisfies Rt RT t = Σt like Kt in Eq. (4). We remark Kt = 1 αt Id is a solution to Eq. (17) when the DM is specialized to DDPM. Based on Eq. (16) and Eq. (17), we extend Prop 1 to more general DMs. Proposition 4. Assume the data distribution p0(u) is N(u0, Σ0). Let u(t) be an arbitrary solution to the probability flow ODE Eq. (7) with the ground truth score, then ϵGT(u(t), t) := RT t log pt(u(t)) remains constant along u(t). Note that Prop 4 is slightly more general than Prop 1 in the sense that the initial distribution p0 is a Gaussian instead of a Dirac. Diffusion models with augmented states such as CLD use a Gaussian distribution on the velocity channel for each data point. Thus, when there is a single data point, the initial distribution is a Gaussian instead of a Dirac distribution. A direct consequence of Prop 4 is that we can conduct accurate sampling in one step in the toy example since we can recover the score along any simulated trajectory given its value at t = T, if Kt in Eq. (4) is set to be Rt. This choice Kt = Rt will make a huge difference in sampling quality as we will show later. The fact provides guidance to design an efficient sampling scheme for realistic data. Realistic dataset: As the accurate score is not available for realistic datasets, we need to use learned score sθ(u, t) for sampling. With our new parameterization ϵθ(u, t) = RT t sθ(u, t) and the approximation ϵθ(u, τ) = ϵθ(u(t), t) for τ [t t, t], we reach the update step for deterministic g DDIM by solving probability flow with approximator ϵθ(u, τ) exactly as u(t t) = Ψ(t t, t)u(t) + [ Z t t 1 2Ψ(t t, τ)GτGT τ R T τ dτ]ϵθ(u(t), t), (18) Multistep predictor-corrector for ODE: Inspired by Zhang & Chen (2022), we further boost the sampling efficiency of g DDIM by combining Eq. (18) with multistep methods (Hochbruck & Ostermann, 2010; Zhang & Chen, 2022; Liu et al., 2022). We derive multistep predictor-corrector methods to reduce the number of steps while retaining accuracy (Press et al., 2007; Sauer, 2005). Empirically, we found that using more NFEs in predictor leads to better performance when the total NFE is small. Thus, we only present multistep predictor for deterministic g DDIM. We include the proof and multistep corrector in App. B. For time discretization grid {ti}N i=0 where t0 = 0, t N = T, Published as a conference paper at ICLR 2023 the q-th step predictor from ti to ti 1 in term of ϵθ parameterization reads u(ti 1) = Ψ(ti 1, ti)u(ti) + j=0 [Cijϵθ(u(ti+j), ti+j)], (19a) Cij = Z ti 1 1 2Ψ(ti 1, τ)GτGT τ R T τ Y k =j [ τ ti+k ti+j ti+k ]dτ. (19b) We note that coefficients in Eqs. (18) and (19b) for general DMs can be calculated efficiently using standard numerical solvers if closed-form solutions are not available. 4.2 STOCHASTIC GDDIM WITH PROP 3 Following the same spirits, we generalize Prop 3 Proposition 5. Assume the data distribution p0(u) is N(u0, Σ0). Given any evaluation of the score function log ps(u(s)), one can recover log pt(u) for any t, u as log pt(u) = Σ 1 t Ψ(t, s)Σs log ps(u(s)) Σ 1 t [u Ψ(t, s)u(s)]. (20) Prop 5 is not surprising; in our example, the score has a closed form. Eq. (20) not only provides an accurate score estimation for our toy dataset, but also serves as a score approximator for realistic data. Realistic dataset: Based on Eq. (20), with the parameterization sθ(u, τ) = R T τ ϵθ(u, τ), we propose the following g DDIM approximator ϵθ(u, τ) for ϵθ(u, τ) ϵθ(u, τ) = R 1 τ Ψ(τ, s)Rsϵθ(u(s), s) + R 1 τ [u Ψ(τ, s)u(s)]. (21) Proposition 6. With the parameterization ϵθ(u, t) = RT t sθ(u, t) and the approximator ϵθ(u, τ) in Eq. (21), the solution to Eq. (6) satisfies u(t) N(Ψ(t, s)u(s) + [ˆΨ(t, s) Ψ(t, s)]Rsϵθ(u(s), s), Pst), (22) where ˆΨ(t, s) is the transition matrix associated with ˆFτ := Fτ + 1+λ2 2 GτGT τ Σ 1 τ and the covariance matrix Pst solves dτ = ˆFτPsτ + Psτ ˆF T τ + λ2GτGT τ , Pss = 0. (23) Our stochastic g DDIM then uses Eq. (22) for update. Though the stochastic g DDIM and the deterministic g DDIM look quite different from each other, there exists a connection between them. Proposition 7. Eq. (22) in stochastic g DDIM reduces to Eq. (18) in deterministic g DDIM when λ = 0. 5 EXPERIMENTS As g DDIM reduces to DDIM for VPSDE and DDIM proves very successful, we validate the generation and effectiveness of g DDIM on CLD and BDM. We design experiments to answer the following questions. How to verify Prop 4 and 5 empirically? Can g DDIM improve sampling efficiency compared with existing works? What differences do the choice of λ and Kt make? We conduct experiments with different DMs and sampling algorithms on CIFAR10 for quantitative comparison. We include more illustrative experiments on toy datasets, high dimensional image datasets, and more baseline comparison in App. C. Choice of Kt: A key of g DDIM is the special choice Kt = Rt which is obtained via solving Eq. (17). In CLD, Dockhorn et al. (2021) choose Kt = Lt based on Cholesky decomposition of Σt and it does not obey Eq. (17). More details regarding Lt are included in App. C. As it is shown in Fig. 1, on real datasets with a trained score model, we randomly pick pixel locations and check the pixel value and ϵθ output along the solutions to the probability flow ODE produced by the high-resolution ODE solver. With the choice Kt = Lt, ϵ(L) θ (u, t; v) suffers from oscillation like x Published as a conference paper at ICLR 2023 value along time. However, ϵ(R) θ (u, t) is much more flat. We further compare samples generated by Lt and Rt parameterizaiton in Tab. 1, where both use the multistep exponential solver in Eq. (19). Table 1: Lt vs Rt (Our) on CLD FID at different NFE Kt 20 30 40 50 Lt 368 167 4.12 3.31 Rt 3.90 2.64 2.37 2.26 Table 2: λ and integrators choice with NFE=50 FID at different λ Method 0.0 0.1 0.3 0.5 0.7 1.0 g DDIM 5.17 5.51 12.13 33 41 49 EM 346 168 137 89 45 57 Choice of λ: We further conduct a study with different λ values. Note that polynomial extrapolation in Eq. (19) is not used here even when λ = 0. As it is shown in Tab. 2, increasing λ deteriorates the sample quality, demonstrating our claim that deterministic DDIM has better performance than its stochastic counterpart when a small NFE is used. We also find stochastic g DDIM significantly outperforms EM, which indicates the effectiveness of the approximation Eq. (21). Accelerate various DMs: We present a comparison among various DMs and various sampling algorithms. To make a fair comparison, we compare three DMs with similar size networks while retaining other hyperparameters from their original works. We make two modifications to DDPM, including continuous-time training (Song et al., 2020b) and smaller stop sampling time (Karras et al., 2022), which help improve sampling quality empirically. For BDM, we note Hoogeboom & Salimans (2022) only supports the ancestral sampling algorithm, a variant of EM algorithm. With reformulated noising and denoising process as SDE Eq. (11), we can generate samples by solving corresponding SDE/ODEs. The sampling quality of g DDIM with 50 NFE can outperform the original ancestral sampler with 1000 NFE, more than 20 times acceleration. Table 3: Acceleration on various DMs with similar training pipelines and architecture. For RK45, we tune its tolerance hyperparameters so that the real NFE is close but not equal to the given NFE. : pre-trained model from Song et al. (2020b). : Karras et al. (2022) apply Heun method in rescaled DM, which is essentially a variant of DEIS (Zhang & Chen, 2022) FID ( ) under different NFE DM Sampler 10 20 50 100 1000 DDPM EM >100 >100 31.2 12.2 2.64 Prob.Flow, RK45 >100 52.5 6.62 2.63 2.56 2nd Heun 66.25 6.62 2.65 2.57 2.56 g DDIM 4.17 3.03 2.59 2.56 2.56 BDM Ancestral sampling >100 >100 29.8 9.73 2.51 Prob.Flow, RK45 >100 68.2 7.12 2.58 2.46 g DDIM 4.52 2.97 2.49 2.47 2.46 CLD EM >100 >100 57.72 13.21 2.39 Prob.Flow, RK45 >100 >100 31.7 4.56 2.25 g DDIM 13.41 3.39 2.26 2.26 2.25 6 CONCLUSIONS AND LIMITATIONS Contribution: The more structural knowledge we leverage, the more efficient algorithms we obtain. In this work, we provide a clean interpretation of DDIMs based on the manifold hypothesis and the sparsity property on realistic datasets. This new perspective unboxes the numerical discretization used in DDIM and explains the advantage of ODE-based sampler over SDE-based when NFE is small. Based on this interpretation, we extend DDIMs to general diffusion models. The new algorithm, g DDIM, only requires a tiny but elegant modification to the parameterization of the score model and improves sampling efficiency drastically. We conduct extensive experiments to validate the effectiveness of our new sampling algorithm. Limitation: There are several promising future directions. First, though g DDIM is designed for general DMs, we only verify it on three DMs. It is beneficial to explore more efficient diffusion processes for different datasets, in which we believe g DDIM will play an important role in designing sampling algorithms. Second, more investigations are needed to design an efficient sampling algorithm by exploiting more structural knowledge in DMs. The structural knowledge can originate from different sources such as different modalities of datasets, and mathematical structures presented in specific diffusion processes. Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers for their useful comments. This work is partially supported by NSF ECCS-1942523, NSF CCF-2008513, and NSF DMS-1847802. Brian D.O. Anderson. Reverse-time diffusion equation models. Stochastic Process. Appl., 12(3): 313 326, May 1982. ISSN 0304-4149. doi: 10.1016/0304-4149(82)90051-5. URL https: //doi.org/10.1016/0304-4149(82)90051-5. Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-DPM: An Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models. 2022. URL http://arxiv. org/abs/2201.06503. Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. 2021. URL http://arxiv.org/abs/2105.05233. Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score-Based Generative Modeling with Critically Damped Langevin Diffusion. pp. 1 13, 2021. URL http://arxiv.org/abs/2112. 07068. Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367, 2018. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 2020-Decem, 2020. ISBN 2006.11239v2. URL https://github.com/hojonathanho/diffusion. Marlis Hochbruck and Alexander Ostermann. Exponential integrators. Acta Numerica, 19:209 286, 2010. Emiel Hoogeboom and Tim Salimans. Blurring diffusion models. ar Xiv preprint ar Xiv:2209.05557, 2022. Alexia Jolicoeur-Martineau, Ke Li, R emi Pich e-Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta go fast when generating data with score-based models. ar Xiv preprint ar Xiv:2105.14080, 2021a. Alexia Jolicoeur-Martineau, Ke Li, R{\ {e}}mi Pich{\ {e}}-Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta Go Fast When Generating Data with Score-Based Models. May 2021b. doi: 10.48550/arxiv.2105.14080. URL http://arxiv.org/abs/2105.14080. Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusionbased generative models. ar Xiv preprint ar Xiv:2206.00364, 2022. Bahjat Kawar, Gregory Vaksman, and Michael Elad. Snips: Solving noisy inverse problems stochastically. Advances in Neural Information Processing Systems, 34, 2021. Zhifeng Kong and Wei Ping. On Fast Sampling of Diffusion Probabilistic Models. 2021a. URL http://arxiv.org/abs/2106.00132. Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. ar Xiv preprint ar Xiv:2106.00132, 2021b. Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo Numerical Methods for Diffusion Models on Manifolds. (2021):1 23, 2022. URL http://arxiv.org/abs/2202.09778. Published as a conference paper at ICLR 2023 Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. ar Xiv preprint ar Xiv:2206.00927, 2022. Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. ar Xiv preprint ar Xiv:2101.02388, 2021. Siwei Lyu. Interpretation and generalization of score matching. ar Xiv preprint ar Xiv:1205.2629, 2012. Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. Ar Xiv, abs/2102.09672, 2021. Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc Grew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. 2021. URL http://arxiv.org/abs/2112.10741. Bernt Oksendal. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013. William H Press, Saul A Teukolsky, William T Vetterling, and Brian P Flannery. Numerical recipes 3rd edition: The art of scientific computing. Cambridge university press, 2007. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark {Chen Open AI}. Hierarchical Text-Conditional Image Generation with CLIP Latents. Severi Rissanen, Markus Heinonen, and Arno Solin. Generative modelling with inverse heat dissipation. ar Xiv preprint ar Xiv:2206.13397, 2022. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj{\ {o}}rn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. 2021. URL http://arxiv. org/abs/2112.10752. Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290 5500:2323 6, 2000. Tim Salimans and Jonathan Ho. Progressive Distillation for Fast Sampling of Diffusion Models. 2022. URL http://arxiv.org/abs/2202.00512. Simo S arkk a and Arno Solin. Applied stochastic differential equations, volume 10. Cambridge University Press, 2019. Timothy Sauer. Numerical analysis, 2005. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265. PMLR, 2015. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising Diffusion Implicit Models. 2020a. URL http://arxiv.org/abs/2010.02502. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations. 2020b. URL http://arxiv.org/abs/2011.13456. Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of scorebased diffusion models. Advances in Neural Information Processing Systems, 34, 2021. Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290 5500:2319 23, 2000. Published as a conference paper at ICLR 2023 Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In Neural Information Processing Systems (Neur IPS), 2021. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Comput., 23(7):1661 1674, July 2011. ISSN 0899-7667, 1530-888X. doi: 10.1162/neco\ a\ 00142. URL https://doi.org/10.1162/neco_a_00142. Daniel Watson, Jonathan Ho, Mohammad Norouzi, and William Chan. Learning to efficiently sample from diffusion probabilistic models. ar Xiv preprint ar Xiv:2106.03802, 2021. Daniel Watson, William Chan, Jonathan Ho, and Mohammad Norouzi. Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality. 2022. URL http://arxiv. org/abs/2202.05830. P. Whalen, M. Brio, and J.V. Moloney. Exponential time-differencing with embedded Runge Kutta adaptive step control. J. Comput. Phys., 280:579 601, January 2015. ISSN 0021-9991. doi: 10.1016/j.jcp.2014.09.038. URL https://doi.org/10.1016/j.jcp.2014.09.038. Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion gans. ar Xiv preprint ar Xiv:2112.07804, 2021. Qinsheng Zhang and Yongxin Chen. Diffusion normalizing flow. Advances in Neural Information Processing Systems, 34, 2021. Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. ar Xiv preprint ar Xiv:2204.13902, 2022. Published as a conference paper at ICLR 2023 A MORE RELATED WORKS Learning generative models with DMs via score matching has received tremendous attention recently (Sohl-Dickstein et al., 2015; Lyu, 2012; Song & Ermon, 2019; Song et al., 2020b; Ho et al., 2020; Nichol & Dhariwal, 2021). However, the sampling efficiency of DMs is still not satisfying. Jolicoeur-Martineau et al. (2021a) introduced adaptive solvers for SDEs associated with DMs for the task of image generation. Song et al. (2020a) modified the forward noising process into a non-Markov process without changing the training objective function. The authors then proposed a family of samplers, including deterministic DDIM and stochastic DDIM, based on the modifications. Both of the samplers demonstrate significant improvements over previous samplers. There are variants of the DDIM that aim to further improve the sampling quality and efficiency. Deterministic DDIM in fact reduces to probability flow in the infinitesimal step size limit (Song et al., 2020a; Liu et al., 2022). Meanwhile, various approaches have been proposed to accelerate DDIM (Kong & Ping, 2021b; Watson et al., 2021; Liu et al., 2022). Bao et al. (2022) improved the DDIM by optimizing the reverse variance in DMs. Watson et al. (2022) generalized the DDIM in DDPM with learned update coefficients, which are trained by minimizing an external perceptual loss. Nichol & Dhariwal (2021) tuned the variance of the schedule of DDPM. Liu et al. (2022) found that the DDIM is a pseudo numerical method and proposed a pseudo linear multi-step method for it. Zhang & Chen (2022) discovered that DDIMs are numerical integrators for marginal-equivalent SDEs, and the deterministic DDIM is actually an exponential integrator for the probability flow ODE. They further utilized exponential multistep methods to boost sampling performance for VPSDE. Another promising approach to accelerate the diffusion model is distillation for the probability flow ODE. Luhman & Luhman (2021) proposed to learn the map from noise to data in a teacher-student fashion, where supervised signals are provided by simulating the deterministic DDIM. The final student network distilled is able to generate samples with reasonable quality within one step. Salimans & Ho (2022) proposed a new progressive distillation approach to improve training efficiency and stability. This distillation approach relies on solving the probability flow ODE and needs extra training procedures. Since we generalize and improve the DDIM in this work, it will be beneficial to combine this distillation method with our algorithm for better performance in the future. Recently, numerical structures of DMs have received more and more attention; they play important roles in efficient sampling methods. Dockhorn et al. (2021) designed Symmetric Splitting CLD Sampler (SSCS) that takes advantage of Hamiltonian structure of the CLD and demonstrated advantages over the naive Euler-Maruyama method. Zhang & Chen (2022) first utilized the semilinear structure presented in DMs and showed that the exponential integrator gave much better sampling quality than the Euler method. The proposed Diffusion Exponential Integrator Sampler (DEIS) further accelerates sampling by utilizing Multistep and Runge Kutta ODE solvers. Similar to DEIS, Lu et al. (2022); Karras et al. (2022) also proposed ODE solvers that utilize analytical forms of diffuson scheduling coefficients. Karras et al. (2022) further improved the stochastic sampling algorithm by augmenting ODE trajectories with small noise. Further tailoring these integrators to account for the stiff property of ODEs (Hochbruck & Ostermann, 2010; Whalen et al., 2015) is a promising direction in fast sampling for DMs. Since the proposed g DDIM is a generalization of DDIM, the results regarding g DDIM in Sec. 4 are generalizations of those in Sec. 3. In particular, Prop 4 generalizes Prop 1, Eq. (18) generalizes Prop 2, and Prop 5 generalizes Prop 3. Thus, for the sake of simplicity, we mainly present proofs for g DDIM in Sec. 4. B.1 BLURRING DIFFUSION MODELS We first review the formulations regarding BDM proposed in Hoogeboom & Salimans (2022) and show it can be reformulated as SDE in continuous time Eq. (11). Hoogeboom & Salimans (2022) introduce an forwarding noising scheme where noise corrupts data in frequency space with different schedules for the dimensions. Different from existing DMs, the Published as a conference paper at ICLR 2023 diffusion process is defined in frequency space: p(yt|y0) = N(yt|αty0, σt I) yt = V T xt (24) where α, σ are Rd d diagonal matries and control the different diffuse rate for data along different dimension. And yt is the mapping of xt in frequency domain obtained through Discrete Cosine Transform (DCT) V T . We note V is the inverse DCT mapping and V T V = I. Based on Eq. (24), we are able to derive its corresponding noising scheme in original data space p(xt|x0) = N(xt|V αt V T x0, σt I). (25) Eq. (25) indicates BDM is a non-isotropic diffusion process. Therefore, we are able to derive its forward process as a linear SDE. For a general linear SDE Eq. (1), its mean and covariance follow dt = Ftmt (26) dt = FtΣt + Σt Ft + Gt GT t . (27) Plugging Eq. (25) into Eq. (26), we are able to derive the drift and diffusion Ft, Gt for BDM as Eq. (11). As BDM admit a SDE formulation, we can use Eq. (3) to train BDM. For choice of hyperparameters αt, σt and pratical implementation, we include more details in App. C. B.2 DETERMINISTIC GDDIM B.2.1 PROOF OF EQ. (17) Since we assume data distribution p0(u) = N(u0, Σ0), the score has closed form log pt(u) = Σ 1 t (u Ψ(t, 0)u0). (28) To make sure our construction Eq. (16) is a solution to the probability flow ODE, we examine the condition for Rt. The LHS of the probability flow ODE is du = d[Ψ(t, 0)u0 + Rtϵ] = Ψ(t, 0)u0dt + Rtϵdt = [FtΨ(t, 0)u0 + Rtϵ]dt. (29) The RHS of the probability flow ODE is 2Gt GT t log pt(u)]dt = [FtΨ(t, 0)u0 + Ft Rtϵ + 1 2Gt GT t R T t ϵ]dt = [FtΨ(t, 0)u0 + Ft Rtϵ + 1 2Gt GT t R T t R 1 t Rtϵ]dt, (30) where the first equality is due to log pt(u) = R T t ϵ. Since Eqs. (29) and (30) holds for each ϵ, we establish Rt = (Ft + 1 2Gt GT t R T t R 1 t )Rt (31) 2Gt GT t Σ 1 t )Rt. (32) B.2.2 PROOF OF PROP 4 Similar to the proof of Eq. (17), over a solution {u(t), t} to the probability flow ODE, R 1 t (u(t) Ψ(t, 0)u0) is constant. Furthermore, by Eq. (28), log pt(u(t)) = Σ 1 t (u(t) Ψ(t, 0)u0) = R T t R 1 t (u(t) Ψ(t, 0)u0) (33) Eq. (33) implies that RT t log pt(u(t)) is a constant and invariant with respect to t. Published as a conference paper at ICLR 2023 B.2.3 PROOF OF EQS. (12) AND (18) We derive Eq. (18) first. The update step is based on the approximation ϵθ(u, τ) = ϵθ(u(t), t) for τ [t t, t]. The resultant ODE with ϵθ reads u = Fτu + 1 2GτGT τ R 1 τ ϵθ(u(t), t), (34) which is a linear ODE. The closed-form solution reads u(ti 1) = Ψ(ti 1, ti)u(t) + [ Z ti 1 1 2Ψ(ti 1, τ)GτGT τ R T τ dτ]ϵθ(u(t), t), (35) where Ψ(t, s) is the transition matrix associated Ft, that is, Ψ satisfies dt = FtΨ(t, s) Ψ(s, s) = ID. (36) When the DM is specified to be DDPM, we derive Eq. (12) based on Eq. (18) by expanding the coefficients in Eq. (18) explicitly as Ψ(t, s) = r αt Ψ(t t, t) = rαt t 1 2Ψ(t t, τ)GτGT τ R 1 τ dτ = Z t t dτ 1 1 ατ dτ B.2.4 PROOF OF MULTISTEP PREDICTOR-CORRECTOR Our Multistep Predictor-Corrector method slightly extends the traditional linear multistep Predictor Corrector method to incorporate the semilinear structure in the probability flow ODE with an exponential integrator (Press et al., 2007; Hochbruck & Ostermann, 2010). For Eq. (7), the key insight of the multistep predictor is to use existing function evaluations ϵθ(u(ti), ti), ϵθ(u(ti+1), ti+1), , ϵθ(u(ti+q 1), ti+q 1) and their timestamps ti, ti+1, , ti+q 1 to fit a q 1 order polynomial ϵp(t) to approximate ϵθ(u(τ), τ). With this approximator ϵθ(u, τ) = ϵp(τ) for τ [ti 1, ti], the multistep predictor step is obtained by solving dt = Ftu + 1 2GτGT τ R T τ ϵθ(u, τ) 2GτGT τ R T τ ϵp(τ), (37) which is a linear ODE. The solution to Eq. (37) satisfies u(ti 1) = Ψ(ti 1, ti)u(ti) + Z ti 1 1 2GτGT τ R T τ ϵp(τ)dτ. (38) Based on Lagrange formula, we can write ϵp(τ) as τ ti+k ti+j ti+k ]ϵθ(uti+j, ti+j). (39) Published as a conference paper at ICLR 2023 Plugging Eq. (39) into Eq. (38), we obtain u(ti 1) = Ψ(ti 1, ti)u(ti) + j=0 [p C(q) ij ϵθ(u(ti+j), ti+j)], (40) p C(q) ij = Z ti 1 1 2Ψ(ti 1, τ)GτGT τ R T τ k =j,k=0 [ τ ti+k ti+j ti+k ]dτ, (41) which are Eqs. (19a) and (19b). Here we use p C(q) ij to emphasize these are constants used in the q-step predictor. The 1-step predictor reduces to Eq. (18). Compared with the explicit scheme for the multistep predictor, the multistep corrector behaves like an implicit method (Press et al., 2007). Instead of constructing ϵp(τ) to extrapolate model output for τ [ti 1, ti] as in the predictor, the q step corrector aims to find ϵc(τ) to interpolate ϵθ(u(ti 1), ti 1), ϵθ(u(ti), ti), ϵθ(u(ti+1), ti+1), , ϵθ(u(ti+q 2), ti+q 2) and their timestamps ti 1, ti, ti+1, , ti+q 2. Thus, u(ti 1) is obtained by solving u(ti 1) = Ψ(ti 1, ti)u(ti) + Z ti 1 1 2GτGT τ R T τ ϵc(τ)dτ. (42) Since ϵc(τ) is defined implicitly, it is not easy to find ϵc(τ), u(ti 1). Instead, practitioners bypass the difficulties by interpolating ϵθ( u(ti 1), ti 1), ϵθ( u(ti), ti), ϵθ( u(ti+1), ti+1), , ϵθ( u(ti+q 2), ti+q 2) where u(ti 1) is obtained by the predictor in Eq. (38) and u(ti) = u(ti), u(ti+1) = u(ti+1), , u(ti+q 2) = u(ti+q 2). Hence, we derive the update step for corrector based on u(ti 1) = Ψ(ti 1, ti)u(ti) + Z ti 1 1 2GτGT τ R T τ ϵc(τ)dτ, (43) where ϵc(τ) is defined as τ ti+k ti+j ti+k ]ϵθ( uti+j, ti+j). (44) Plugging Eq. (44) into Eq. (43), we reach the update step for the corrector u(ti 1) = Ψ(ti 1, ti)u(ti) + j= 1 [c C(q) ij ϵθ( u(ti+j), ti+j)], (45) c C(q) ij = Z ti 1 1 2Ψ(ti 1, τ)GτGT τ R T τ k =j,k= 1 [ τ ti+k ti+j ti+k ]dτ. (46) We use c C(q) ij to emphasis constants used in the q-step corrector. Exponential multistep Predictor-Corrector: Here we present the Exponential multistep Predictor-Corrector algorithm. Specifically, we employ one q-step corrector update step after an update step of the q-step predictor. The interested reader can easily extend the idea to employ multiple update steps of corrector or different number of steps for the predictor and the corrector. We note coefficients p C, c C can be calculated using high resolution ODE solver once and used everywhere. Published as a conference paper at ICLR 2023 Algorithm 1 Exponential multistep Predictor-Corrector Input: Timestamps {ti}N i=0, step order q, coefficients for predictor update p C, coefficients for corrector update c C Instantiate: u(t N) p T (u) for i in N, N 1, , 1 do # predictor update step qcur = min(q, N i + 1) # handle warming start, use lower order multistep method uti 1 Simulate Eq. (40) with qcur-step predictor # corrector update step qcur = min(q, N i + 2) # handle warming start, use lower order multistep method u(ti 1), u(ti), , u(ti+qcur 1) u(ti 1), u(ti), , u(ti+qcur 1) uti 1 Simulate Eq. (45) with qcur-step corrector end for B.3 STOCHASTIC GDDIM B.3.1 PROOF OF PROP 5 Assuming that the data distribution p0(u) is N(u0, Σ0) with a given Σ0, we can derive the mean and covariance of pt(u) as µt = Ψ(t, 0) (47) dΣt dt = FtΣt + Σt F T t + Gt GT t . (48) Therefore, the ground truth score reads log pt(u) = Σ 1 t (u Ψ(t, 0)u0). (49) We assume Σ0 is given but u0 is unknown. Fortunately, u0 can be inferred via one score evaluation as follows. Given evaluation log ps(u(s)), we can recover u0 as u0 = Ψ(0, s)[Σs log ps(u(s)) + u(s)]. (50) Plugging Eq. (50) and Ψ(t, s) = Ψ(t, 0)Ψ(0, s) into Eq. (49), we recover Eq. (20). B.3.2 PROOF OF PROP 6 With the approximator ϵθ(u, t) defined in Eq. (21) for τ [s, t], Eq. (6) can be reformulated as du = Fτudτ + 1 + λ2 2 GτGT τ R T τ ϵθ(u, τ)dt + λGτdw = (Fτ + 1 + λ2 2 GτGT τ R T τ R 1 τ )udt 2 GτGT τ R T τ R 1 τ Ψ(τ, s)(Rsϵθ(u(s), s) u(s))dt + λGτdw. (51) Define ˆFτ := Fτ + 1+λ2 2 GτGT τ R T τ R 1 τ = Fτ + 1+λ2 2 GτGT τ Σ 1 τ , and denote by ˆΨ(t, s) the transition matrix associated with it. Clearly, Eq. (51) is a linear differential equation on u, and the conditional probability ˆpst(u(t)|u(s)) associated with it is a Gaussian distribution. Applying S arkk a & Solin (2019, Eq (6.6,6.7)), we obtain the exact expressions Mean = ˆΨ(t, s)u(s) [ Z t s ˆΨ(t, τ)1 + λ2 2 GτGT τ Σ 1 τ Ψ(τ, s)]dτ u(s) s ˆΨ(t, τ)1 + λ2 2 GτGT τ Σ 1 τ Ψ(τ, s)]dτ Rsϵθ(u(s), s) (52) for the mean of ˆpst(u(t)|u(s)). Its covariance Psτ satisfies d Psτ dτ = ˆ FτPsτ + Psτ ˆ Fτ T + λ2GτGT τ , Pss = 0. (53) Eq. (52) has a closed form expression with the help of the following lemma. Published as a conference paper at ICLR 2023 s ˆΨ(t, τ)1 + λ2 2 GτGT τ Σ 1 τ Ψ(τ, s) = ˆΨ(t, s) Ψ(t, s) Proof. For a fixed s, we define N(t) = R t s ˆΨ(t, τ) 1+λ2 2 GτGT τ Σ 1 τ Ψ(τ, s) and M(t) = ˆΨ(t, s) Ψ(t, s). It follows that dτ = ˆFτN(τ) + 1 + λ2 2 GτGT τ Σ 1 τ Ψ(τ, s) = FτN(τ) + 1 + λ2 2 GτGT τ Σ 1 τ [N(τ) + Ψ(τ, s)] (54) dτ = ˆFτ ˆΨ(τ, s) FτΨ(τ, s) = FτM(τ) + 1 + λ2 2 GτGT τ Σ 1 τ ˆΨ(τ, s). (55) Define E(t) = N(t) M(t), then dt = (Ft + 1 + λ2 2 Gt GT t Σ 1 t )E(t). (56) On the other hand, N(s) = M(s) = 0 which implies E(s) = 0. We thus conclude E(t) = 0 and N(t) = M(t). Using lemma 1, we simplify Eq. (52) to Ψ(t, s)u(s) + [ˆΨ(t, s) Ψ(t, s)]Rsϵθ(u(s), s), (57) which is the mean in Eq. (22). B.3.3 PROOF OF THM 1 We restate the conclusion presented in Thm 1. The exact solution u(t t) to Eq. (6) with coefficient Eq. (8) is u(t t) N( rαt t αt u(t) + rαt t 1 αt t σ2 t ϵθ(u(t), t), σ2 t Id) with σ2 t = (1 αt t) 1 1 αt t λ2 , which is the same as the DDIM Eq. (9). Thm 1 is a concrete application of Eq. (22) when the DM is a DDPM and Fτ, Gτ are set to Eq. (8). Thanks to the special form of Fτ, Ψ has the expression Ψ(t, s) = r αt αs Id, (59) and ˆΨ satisfies log ˆΨ(t, s) = Z t dτ 1 1 ατ ]dτ (60) ˆΨ(t, s) = 1 αt Published as a conference paper at ICLR 2023 Mean: Based on Eq. (22), we obtain the mean of ˆpst as 1 αs + 1 αt ϵθ(u(s), s) (62) (1 αt) 1 αt t ϵθ(u(s), s) αs u(s) + r αt ϵθ(u(s), s), (64) σ2 s = (1 αt) Setting (s, t) (t, t t), we arrive at the mean update in Eq. (14). Covariance: It follows from dτ = 2[d log ατ dτ 1 1 ατ ]Psτ λ2 d log ατ dτ Id, Pss = 0 Pst = (1 αt) Setting (s, t) (t, t t), we recover the covariance in Eq. (14). B.3.4 PROOF OF PROP 7 When λ = 0, the update step in Eq. (22) from s to t reads u(t) = Ψ(t, s)u(s) + [ˆΨ(t, s) Ψ(t, s)]Rsϵθ(u(s), s). (66) Meanwhile, the update step in Eq. (18) from s to t is u(t) = Ψ(t, s)u(s) + [ Z t 1 2Ψ(t, τ)GτGT τ R T τ dτ]ϵθ(u(s), s). (67) Eqs. (66) and (67) are equivalent once we have the following lemma. Lemma 2. When λ = 0, Z t 1 2Ψ(t, τ)GτGT τ R T τ dτ = [ˆΨ(t, s) Ψ(t, s)]Rs. Proof. We introduce two new functions N(t) := Z t 1 2Ψ(t, τ)GτGT τ R T τ dτ (68) M(t) := [ˆΨ(t, s) Ψ(t, s)]Rs. (69) First, N(s) = M(s) = 0. Second, they satisfy dt = Ft N(t) + 1 2Gt GT t R T t (70) dt = [ ˆFt ˆΨ(t, s) FtΨ(t, s)]Rs (71) = Ft M(t) + 1 2Gt GT t Σ 1 t ˆΨ(t, s)Rs. (72) Published as a conference paper at ICLR 2023 Note ˆΨ and R satisfy the same linear differential equation as dt = [Ft + 1 2Gt GT t Σ 1 t ]ˆΨ(t, s), d Rt dt = [Ft + 1 2Gt GT t Σ 1 t ]Rt. (73) It is a standard result in linear system theory (see S arkk a & Solin (2019, Eq(2.34))) that ˆΨ(t, s) = Rt R 1 s . Plugging it and Rt RT t = Σt into Eq. (72) yields dt = Ft M(t) + 1 2Gt GT t R T t . (74) Define E(t) = N(t) M(t), then it satisfies E(s) = 0 d E(t) dt = Ft E(t), (75) which clearly implies that E(t) = 0. Thus, N(t) = M(t). C MORE EXPERIMENT DETAILS We present the practical implementation of g DDIM and its application to BMD and CLD. We include training details and discuss the necessary calculation overhead for executing g DDIM. More experiments are conducted to verify the effectiveness of g DDIM compared with other sampling algorithms. We report image sampling performance over an average of 3 runs with different random seeds. C.1 BDM: TRAINING AND SAMPLING Unfortunately, the pre-trained models for BDM are not available. We reproduce the training pipeline in BDM (Hoogeboom & Salimans, 2022) to validate the acceleration of g DDIM. The official pipeline is quite similar to the popular DDPM (Ho et al., 2020). We highlight the main difference and changes in our implementation. Compared DDPM, BDM use a different forward noising scheme Eqs. (11) and (25). The two key hyperparameters {αt}, {σt} follow the exact same setup in Hoogeboom & Salimans (2022), whose details and python implementation can be found in Appendix A (Hoogeboom & Salimans, 2022). In our implementation, we use Unet network architectures (Song et al., 2020b). We find our larger Unet improves samples quality. As a comparison, our SDE sampler can achieve FID as low as 2.51 while Hoogeboom & Salimans (2022) only has 3.17 on CIFAR10. C.2 CLD: TRAINING AND SAMPLING For CLD, Our training pipeline, model architectures and hyperparameters are similar to those in Dockhorn et al. (2021). The main differences are in the choice of Kt and loss weights K 1 t Λt K T t . Denote by ϵθ(u, t) = [ϵθ(u, t; x), ϵθ(u, t; v)] for corresponding model parameterization. The authors of Dockhorn et al. (2021) originally propose the parameterization sθ(u, t) = L T t ϵθ(u, t) where Σt = Lt LT t is the Cholesky decomposition of the covariance matrix of p0t(u(t)|x(0)). Built on DSM Eq. (3), they propose hybrid score matching (HSM) that is claimed to be advantageous (Dockhorn et al., 2021). It uses the loss Et U[0,T ]Ex(0),u(t)|x(0)[ ϵ ϵθ(µt(x0) + Ltϵ, t) 2 L 1 t Λt L T t ]. (76) With a similar derivation (Dockhorn et al., 2021), we obtain the HSM loss with our new score parameterization sθ(u, t) = L T t ϵθ(u, t) as Et U[0,T ]Ex(0),u(t)|x(0)[ ϵ ϵθ(µt(x0) + Rtϵ, t) 2 R 1 t Λt R T t ]. (77) Though Eqs. (76) and (77) look similar, we cannot directly use pretrained model provided in Dockhorn et al. (2021) for g DDIM. Due to the lower triangular structure of Lt and the special Gt, the solution to Eq. (6) only relies on ϵθ(u, t; v) and thus only ϵθ(u, t; v) is learned in Dockhorn Published as a conference paper at ICLR 2023 et al. (2021) via a special choice of Λt. In contrast, in our new parametrization, both ϵθ(u, t; x) and ϵθ(u, t; v) are needed to solve Eq. (6). To train the score model for g DDIM, we set R 1 t Λt R T t = I for simplicity, similar to the choice made in Ho et al. (2020). Our weight choice has reasonable performance and we leave improvement possibilities, such as mixed score (Dockhorn et al., 2021), better Λt weights (Song et al., 2021), for future work. Though we require a different training scheme of score model compared with Dockhorn et al. (2021), the modifications to the training pipeline and extra costs are almost ignorable. We change from Kt = Lt to Kt = Rt. Unlike Lt which has a triangular structure and closed form expression as (Dockhorn et al., 2021) Σxx t Σxv t (Σxv t )2 , with Σt = Σxx t Σxv t Σxv t Σvv t we rely on high accurate numerical solver to solve Rt. The triangular structure of Lt and sparse pattern of Gt for CLD in Eq. (10) also have an impact on the training loss function of the score model. Due to the special structure of Gt, we only need to learn sθ(u, t; v) signals present in the velocity channel. When Kt = Lt, sθ(u, t) = L T t ϵ(Lt) θ (u, t) with L T t being upper triangular. Thus we only need to train ϵ(Lt) θ (u, t; v) to recover sθ(u, t; v). In contrast, Rt does not share the triangular structure as Lt; both ϵ(Rt) θ (u, t; x), ϵ(Rt) θ (u, t; v) are needed to recover sθ(u, t; v). Therefore, Dockhorn et al. (2021) sets loss weights L 1 t Λt L T t = 0 0 0 1 while we choose R 1 t Λt R T t = 1 0 0 1 As a result, we need to double channels in the output layer in our new parameterization associated with Rt, though the increased number of parameters in last layer is negligible compared with other parts of diffusion models. We include the model architectures and hyperparameters in Tab. 4. In additional to the standard size model on CIFAR10, we also train a smaller model for CELEBA to show the efficacy and advantages of g DDIM. Table 4: Model architectures and hyperparameters Hyperparameter CIFAR10 CELEBA Model EMA rate 0.9999 0.999 # of Res Block per resolution 8 2 Normalization Group Normalization Group Normalization Progressive input Residual None Progressive combine Sum N/A Finite Impluse Response Enabled Disabled Embedding type Fourier Positional # of parameters 108M 62M Training # of iterations 1m 150k Optimizer Adam Adam Learning rate 2 10 4 2 10 4 Gradient norm clipping 1.0 1.0 Dropout 0.1 0.1 Batch size per GPU 32 32 GPUs 4 A6000 4 A6000 Training time 79h 16h Published as a conference paper at ICLR 2023 C.3 CALCULATION OF CONSTANT COEFFICIENTS In g DDIM, many coefficients cannot be obtained in closed-form. Here we present our approach to obtain them numerically. Those constant coefficients can be divided into two categories, solutions to ODEs and definite integrals. We remark that these coefficients only need to be calculated once and then can be used everywhere. For CLD, each of these coefficient corresponds to a 2 2 matrix. The calculation of all these coefficients can be done within 1 min. Type I: Solving ODEs The problem appears when we need to evaluate Rt in Eq. (17) and ˆΨ(t, s) in dt = ˆFt ˆΨ(t, s), ˆΨ(s, s) = I. (81) Across our experiments, we use RK4 with a step size 10 6 to calculate the ODE solutions. For ˆΨ(t, s), we only need to calculate ˆΨ(t, 0) because ˆΨ(t, s) = ˆΨ(t, 0)[ˆΨ(s, 0)] 1. In CLD, Ft, Gt, Rt, Σt can be simplified to a 2 2 matrix; solving the ODE with a small step size is extremely fast. We note Ψ(t, s) and Σt admit close-form formula (Dockhorn et al., 2021). Since the output of numerical solvers are discrete in time, we employ a linear interpolation to handle query in continuous time. Since Rt, ˆΨ are determined by the forward SDE in DMs, the numerical results can be shared. In stochastic g DDIM Eq. (22), we apply the same techniques to solve Pst. In BDM, Ft can be simplified into matrix whose shape align with spatial shape of given images. We note that the drift coefficient of BDM is a diagonal matrix, we can decompose matrix ODE into multiple one dimensional ODE. Thanks to parallel computation in GPU, solving multiple one dimensional ODE is efficient. Type II: Definite integrals The problem appears in the derivation of update step in Eqs. (18), (19b) and (46), which require coefficients such as p C(q) ij , c C(q) ij . We use step size 10 5 for the integration from t to t t. The integrand can be efficiently evaluated in parallel using GPUs. Again, the coefficients, such as p C(q) ij , c C(q) ij , are calculated once and used afterwards if we need sample another batch with the same time discretization. C.4 GDDIM WITH OTHER DIFFUSION MODELS Though we only test g DDIM in several existing diffusion models, including DDPM, BDM, and CLD, g DDIM can be applied to any other pre-trained diffusion models as long as the full score function is available. In the following, we list key procedures to integrate g DDIM sampler into general diffusion models. The integration consists of two stages, offline preparation of g DDIM (Stage I), and online execution of g DDIM (Stage II). Stage I: Offline preparation of g DDIM Preparation of g DDIM includes scheduling timestamps and the calculation of coefficients for the execution of g DDIM. Step 1: Determine an increasing time sequence T = {ti} Step 2: Obtain Ψ(t, s) by solving ODE Eq. (81). Step 3: Calculate Rt by solving ODE Eq. (17) Step 4: Obtain p C(q) ij , c C(q) ij via applying definite integrator solvers on Eqs. (18), (19b) and (46) How to solve ODEs Eqs. (17) and (81) and definite integrals Eqs. (18), (19b) and (46) has been discussed in App. C.3. Stage II: Online execution of g DDIM Stage II employs high order EI-based ODE solvers for Eq. (82) with Kt = Rt. We include pseudo-code for simulating EI-multistep solvers in Algo 1. It mainly uses updates Eq. (40) and Eq. (45). Published as a conference paper at ICLR 2023 C.5 MORE EXPERIMENTS ON THE CHOICE OF SCORE PARAMETERIZATION Here we present more experiments details and more experiments regarding Prop 1 and differences between the two parameterizations involving Rt and Lt. Toy experiments: Figure 3: Trajectory and ϵ of Probability Flow solution in CLD Euler EI Kt = Σt EI Kt = Lt EI Kt = Rt NFE=15 NFE=25 NFE=35 NFE=60 NFE=200 Figure 4: Sampling on a challenging 2D example with the exact score log pt(u), where data distribution is a mixture of Gaussian with small variance. Compared with Euler, algorithms based on Exponential Integrator (EI) Eq. (82) have much better sampling quality. Among EI-based samplers, different Kt for score parameterization ϵ(u, t) = K T t log pt(u) have different sampling performances when NFE is small. Clearly, Rt proposed by g DDIM enjoys better sampling quality given the same NFE budget. Here we present more empirical results to demonstrate the advantage of proper Kt. In VPSDE, the advantage of DDIM has been verified in various models and datasets (Song et al., 2020a). To empirically verify Prop 1, we present one toy example where the data distribution is a mixture of two one dimension Gaussian distributions. The ground truth log pt(u) is known in this toy example. As shown in Fig. 2, along the solution to probability flow ODE, the score parameterization ϵ(u, t) = log pt(u) 1 αt enjoys smoothing property. We remark that Rt = Lt = Σt = 1 αt Id in VPSDE, and thus g DDIM is the same as DDIM; the differences among Rt, Lt, Σt only appear when Σt is non-diagonal. Published as a conference paper at ICLR 2023 FID / IS at different NFE q Kt 20 30 40 50 0 Lt 461.72 / 1.18 441.05 / 1.21 244.26 / 2.39 120.07 / 5.25 Rt 16.74 / 8.67 9.73 / 8.98 6.32 / 9.24 5.17 / 9.39 1 Lt 368.38 / 1.32 232.39 / 2.43 99.45 / 4.92 57.06 / 6.70 Rt 8.03 / 9.12 4.26 / 9.49 3.27 / 9.61 2.67 / 9.65 2 Lt 463.33 / 1.17 166.90 / 3.56 4.12 / 9.25 3.31 / 9.38 Rt 3.90 / 9.56 2.66 / 9.64 2.39 / 9.74 2.32 / 9.75 3 Lt 464.36 / 1.17 463.45 / 1.17 463.32 / 1.17 240.45 / 2.29 Rt 332.70 / 1.47 292.31 / 1.70 13.27 / 10.15 2.26 / 9.77 Table 5: More experiments on CIFAR10 FID at different NFE q Kt 20 30 40 50 0 Lt 446.56 430.59 434.79 379.73 Rt 37.72 13.65 12.51 8.96 1 Lt 261.90 123.49 105.75 88.31 Rt 12.68 7.78 5.93 5.11 2 Lt 446.74 277.28 6.18 5.48 Rt 6.86 5.67 4.62 4.19 3 Lt 449.21 440.84 443.91 286.22 Rt 386.14 349.48 20.14 3.85 Table 6: More experiments on CELEBA In CLD, we present a similar study. As the covariance Σt in CLD is no longer diagonal, we find the difference of the Lt parameterization suggested by (Dockhorn et al., 2021) and the Rt parameterization is large in Fig. 3. The oscillation of {ϵ(Lt)} prevents numerical solvers from taking large step sizes and slows down sampling. We also include a more challenging 2D example to illustrate the difference further in Fig. 4. We compare sampling algorithms based on Exponential Integrator without multistep methods for a fair comparison, which reads u(t t) = Ψ(t t, t)u(t) + [ Z t t 1 2Ψ(t t, τ)GτGT τ K T τ dτ]ϵ(Kt)(u, t), (82) where ϵ(Kt)(u, t) = KT t log pt(u(t)). Though we have the exact score, sampling with Eq. (82) will not give us satisfying results if we use Kt = Lt other than Rt when NFE is small. Image experiments: We present more empirical results regarding the comparison between Lt and Rt. Note that we use exponential integrator for the parametrization Lt as well, similar to DEIS (Zhang & Chen, 2022). We vary the polynomial order q in multistep methods (Zhang & Chen, 2022) and test sampling performance on CIFAR10 and CELEBA. In both datasets, we generate 50k images and calcualte their FID. As shown in Tabs. 5 and 6, Rt has significant advantages, especially when NFE is small. We also find that multistep method with a large q can harm sampling performance when NFE is small. This is reasonable; the method with larger q assumes the nonlinear function is smooth in a large domain and may rely on outdated information for approximation, which may worsen the accuracy. C.6 MORE EXPERIMENTS ON THE CHOICE OF λ To study the effects of λ, we visualize the trajectories generated with various λ but the same random seeds in Fig. 5 on our toy example. Clearly, trajectories with smaller λ have better smoothing property while trajectories with large λ contain much more randomness. From the fast sampling perspective, trajectories with more stochasticity are much harder to predict with small NFE compared with smooth trajectories. Published as a conference paper at ICLR 2023 Figure 5: Sampling with various λ and accurate score on a Toy example. Trajectories are shown from T to 0. We include more qualitative results on the choice of λ and comparison between the Euler-Maruyama (EM) method and the g DDIM in Figs. 8 and 9. Clearly, when NFE is small, increasing λ has a negative effect on the sampling quality of g DDIM. We hypothesize that λ = 0 already generates high-fidelity samples and additional noise may harm the sampling performance. With a fixed number of function evaluations, information derived from score network fails to remove the injected noise as we increase λ. On the other hand, we find that the EM method shows slightly better quality as we increase λ. We hypothesize that the ODE or SDEs with small λ has more oscillations than SDEs with large λ. It is known that the EM method has a very bad performance for oscillations systems and suffers from large discretization error (Press et al., 2007). From previous experiments, we find that ODE in CLD is highly oscillated. We also find both methods perform worse than Symmetric Splitting CLD Sampler (SSCS) (Dockhorn et al., 2021) when λ = 1. The improvement by utilizing Hamiltonian structure and SDEs structure is significant. This encourages further exploration that incorporates Hamiltonian structure into g DDIM in the future. Nevertheless, we also remark that SSCS with λ = 1.0 performs much worse than g DDIM with λ = 0. C.7 MORE COMPARISONS We also compare the performance of the CLD model we trained with that claimed in Dockhorn et al. (2021) in Tab. 7. We find that our trained model performs worse than Dockhorn et al. (2021) when a blackbox ODE solver or EM sampling scheme with large NFE are used. There may be two reasons. First, with similar size model, our training scheme not only needs to fit v log pt(u), but also x log pt(u), while Dockhorn et al. (2021) can allocate all representation resources of neural network to v log pt(u). Another factor is the mixed score trick on parameterization, which is shown empirically have a boost in model performance (Dockhorn et al., 2021) but we do not include it in our training. We also compare our algorithm with more accelerating sampling methods in Tab. 7. g DDIM has achieved the best sampling acceleration results among training-free methods, but it still cannot compete with some distillation-based acceleration methods. In Tab. 8, we compare Predictor-only method with Predictor-Corrector (PC) method. With the same number of steps N, PC can improve the quality of Predictor-only at the cost of additional N 1 score evaluations, which is almost two times slower compared with the Predictor-only method. We also find large q may harm the sampling performance in the exponential multistep method when NFE is small. We note high order polynomial requires more datapoints to fit polynomial. The used datapoints may be out-of-date and harmful to sampling quality when we have large stepsizes. C.8 NEGATIVE LOG LIKELIHOOD EVALUATION Because our method only modifies the score parameterization compared with the original CLD (Dockhorn et al., 2021), we follow a similar procedure to evaluate the bound of negative log-likelihood (NLL). Specifically, we can simulate probability ODE Eq. (7) to estimate the log- Published as a conference paper at ICLR 2023 Table 7: More comparison on CIFAR10. FIDs may be reported based on different training techniques and data augmentation. It should not be regarded as the only evidence to compare different algorithms. Class Model NFE ( ) FID ( ) our CLD-SGM (g DDIM) 50 2.26 our CLD-SGM (SDE, EM) 2000 2.39 our CLD-SGM (Prob.Flow, RK45) 155 2.86 our CLD-SGM (Prob.Flow, RK45) 312 2.26 CLD-SGM (Prob.Flow, RK45) by Dockhorn et al. (2021) 147 2.71 CLD-SGM (SDE, EM) by Dockhorn et al. (2021) 2000 2.23 Training-free Accelerated Score DDIM by Song et al. (2020a) 100 4.16 Gotta Go Fast by Jolicoeur-Martineau et al. (2021b) 151 2.73 Analytic-DPM by Bao et al. (2022) 100 3.55 Fast DPM by Kong & Ping (2021a) 100 2.86 PNDM by Liu et al. (2022) 100 3.53 DEIS by Zhang & Chen (2022) 50 2.56 DPM-Solver by Lu et al. (2022) 50 2.65 Rescaled 2nd Order Heun by Karras et al. (2022) 35 1.97 Training-needed Accelerated Score DDSS by Watson et al. (2022) 25 4.25 Progressive Distillation by Salimans & Ho (2022) 4 3.0 Knowledge distillation by Luhman & Luhman (2021) 1 9.36 Score+Others LSGM by Vahdat et al. (2021) 138 2.10 LSGM-100M by Vahdat et al. (2021) 131 4.60 Diffusion GAN by Xiao et al. (2021) 4 3.75 FID at different steps N q Method 20 30 40 50 0 Predictor 16.74 9.73 6.32 5.17 1 Predictor 8.03 4.26 3.27 2.67 PC 6.24 2.36 2.26 2.25 2 Predictor 3.90 2.66 2.39 2.32 PC 3.01 2.29 2.26 2.26 3 Predictor 332.70 292.31 13.27 2.26 PC 337.20 313.21 2.67 2.25 Table 8: Predictor-only vs Predictor-Corrector (PC). Compared with Predictor-only, PC adds one more correcting step after each predicting step except the last step. When sampling with N step, the Predictor-only approach requires n score evaluation, while PC consumes 2N 1 NFE. Published as a conference paper at ICLR 2023 likelihood of given data (Grathwohl et al., 2018). However, our diffusion model models the joint distribution p(u0) = p(x0, v0) on test data x0 and augmented velocity data v0. Getting marginal distribution p(x0) from p(u0) is challenging, as we need integrate v0 for each x0. To circumvent this issue, Dockhorn et al. (2021) derives a lower bound on the log-likelihood, log p(x0) = log( Z p(v0)p(x0, v0) Ev0 p(v0)[log p(x0, v0)] + H(p(v0)), where H(p(v0)) denotes the entropy of p(v0). We can then estimate the lower bound with the Monte Carlo approach. Empirically, our trained model achieves a NLL upper bound 3.33 bits/dim, which is comparable with 3.31 bits/dim reported in the original CLD (Dockhorn et al., 2021). Possible approaches to further reduce the NLL bound include maximal likelihood weights (Song et al., 2021), and improved training techniques such as mixed score. For more discussions of log-likelihood and how to tighten the bound, we refer the reader to Dockhorn et al. (2021). C.9 CODE LICENSES We implemented g DDIM and related algorithms in Jax. We have used code from a number of sources in Tab. 9. URL Citation License https://github.com/yang-song/score_sde Song et al. (2020b) Apache License 2.0 https://github.com/nv-tlabs/CLD-SGM Dockhorn et al. (2021) NVIDIA License https://github.com/qsh-zh/deis Zhang & Chen (2022) Unknown Table 9: Code License Published as a conference paper at ICLR 2023 Figure 6: Comparison between L (Upper) and R (Lower) with exponential integrator on CIFAR10. Published as a conference paper at ICLR 2023 Figure 7: Comparison between L (Upper) and R (Lower) with exponential integrator on CELEBA. Published as a conference paper at ICLR 2023 Figure 8: Comparison between EM (Upper) and g DDIM (Lower) on CIFAR10. Published as a conference paper at ICLR 2023 Figure 9: Comparison between EM (Upper) and g DDIM (Lower) on CELEBA.