# on_calibrating_diffusion_probabilistic_models__55a1c834.pdf

On Calibrating Diffusion Probabilistic Models

Tianyu Pang 1, Cheng Lu2, Chao Du1, Min Lin1, Shuicheng Yan1, Zhijie Deng 3

1Sea AI Lab, Singapore 2Department of Computer Science, Tsinghua University 3Qing Yuan Research Institute, Shanghai Jiao Tong University {tianyupang, duchao, linmin, yansc}@sea.com; lucheng.lc15@gmail.com; zhijied@sjtu.edu.cn

Recently, diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks. A typical DPM framework includes a forward process that gradually diffuses the data distribution and a reverse process that recovers the data distribution from time-dependent data scores. In this work, we observe that the stochastic reverse process of data scores is a martingale, from which concentration bounds and the optional stopping theorem for data scores can be derived. Then, we discover a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can consequently be increased. We provide general calibration guidelines under various model parametrizations. Our calibration method is performed only once and the resulting models can be used repeatedly for sampling. We conduct experiments on multiple datasets to empirically validate our proposal. Our code is available at https://github.com/thudzj/Calibrated-DPMs.

1 Introduction

In the past few years, denoising diffusion probabilistic modeling [17, 40] and score-based Langevin dynamics [42, 43] have demonstrated appealing results on generating images. Later, Song et al. [46] unify these two generative learning mechanisms through stochastic/ordinary differential equations (SDEs/ODEs). In the following we refer to this unified model family as diffusion probabilistic models (DPMs). The emerging success of DPMs has attracted broad interest in downstream applications, including image generation [10, 22, 48], shape generation [4], video generation [18, 19], superresolution [35], speech synthesis [5], graph generation [51], textual inversion [13, 34], improving adversarial robustness [50], and text-to-image large models [32, 33], just to name a few.

A typical framework of DPMs involves a forward process gradually diffusing the data distribution q0(x0) towards a noise distribution q T (x T ). The transition probability for t [0, T] is a conditional Gaussian distribution q0t(xt|x0) = N(xt|αtx0, σ2 t I), where αt, σt R+. Song et al. [46] show that there exist reverse SDE/ODE processes starting from q T (x T ) and sharing the same marginal distributions qt(xt) as the forward process. The only unknown term in the reverse processes is the data score xt log qt(xt), which can be approximated by a time-dependent score model st θ(xt) (or with other model parametrizations). st θ(xt) is typically learned via score matching (SM) [20].

In this work, we observe that the stochastic process of the scaled data score αt xt log qt(xt) is a martingale w.r.t. the reverse-time process of xt from T to 0, where the timestep t can be either continuous or discrete. Along the reverse-time sampling path, this martingale property leads to concentration bounds for scaled data scores. Moreover, a martingale satisfies the optional stopping theorem that the expected value at a stopping time is equal to its initial expected value.

Corresponding authors.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Based on the martingale property of data scores, for any t [0, T] and any pretrained score model st θ(xt) (or with other model parametrizations), we can calibrate the model by subtracting its expectation over qt(xt), i.e., Eqt(xt) [st θ(xt)]. We formally demonstrate that the calibrated score model st θ(xt) Eqt(xt) [st θ(xt)] achieves lower values of SM objectives. By the connections between SM objectives and model likelihood of the SDE process [23, 45] or the ODE process [28], the calibrated score model has higher evidence lower bounds. Similar conclusions also hold for the conditional case, in which we calibrate a conditional score model st θ(xt, y) by subtracting its conditional expectation Eqt(xt|y) [st θ(xt, y)].

In practice, Eqt(xt) [st θ(xt)] or Eqt(xt|y) [st θ(xt, y)] can be approximated using noisy training data when the score model has been pretrained. We can also utilize an auxiliary shallow model to estimate these expectations dynamically during pretraining. When we do not have access to training data, we could calculate the expectations using data generated from st θ(xt) or st θ(xt, y). In experiments, we evaluate our calibration tricks on the CIFAR-10 [25] and Celeb A 64 64 [27] datasets, reporting the FID scores [16]. We also provide insightful visualization results on the AFHQv2 [7], FFHQ [21] and Image Net [9] at 64 64 resolution.

2 Diffusion probabilistic models

In this section, we briefly review the notations and training paradigms used in diffusion probabilistic models (DPMs). While recent works develop DPMs based on general corruptions [2, 8], we mainly focus on conventional Gaussian-based DPMs.

2.1 Forward and reverse processes

We consider a k-dimensional random variable x Rk and define a forward diffusion process on x as {xt}t [0,T ] with T > 0, which satisfies t [0, T],

x0 q0(x0), q0t(xt|x0) = N(xt|αtx0, σ2 t I). (1)

Here q0(x0) is the data distribution; αt and σt are two positive real-valued functions that are differentiable w.r.t. t with bounded derivatives. Let qt(xt) = R q0t(xt|x0)q0(x0)dx0 be the marginal distribution of xt. The schedules of αt, σ2 t need to ensure that q T (x T ) N(x T |0, eσ2I) for some eσ. Kingma et al. [23] prove that there exists a stochastic differential equation (SDE) satisfying the forward transition distribution in Eq. (1), and this SDE can be written as

dxt = f(t)xtdt + g(t)dωt, (2)

where ωt Rk is the standard Wiener process, f(t) = d log αt

dt , and g(t)2 = dσ2 t dt 2 d log αt

dt σ2 t . Song et al. [46] demonstrate that the forward SDE in Eq. (2) corresponds to a reverse SDE constructed as

dxt = f(t)xt g(t)2 xt log qt(xt) dt + g(t)dωt, (3)

where ωt Rk is the standard Wiener process in reverse time. Starting from q T (x T ), the marginal distribution of the reverse SDE process is also qt(xt) for t [0, T]. There also exists a deterministic process described by an ordinary differential equation (ODE) as dxt

dt = f(t)xt 1

2g(t)2 xt log qt(xt), (4)

which starts from q T (x T ) and shares the same marginal distribution qt(xt) as the reverse SDE in Eq. (3). Moreover, let q0t(x0|xt) = q0t(xt|x0)q0(x0)

qt(xt) and by Tweedie s formula [12], we know that αt Eq0t(x0|xt) [x0] = xt + σ2 t xt log qt(xt).

2.2 Training paradigm of DPMs

To estimate the data score xt log qt(xt) at timestep t, a score-based model st θ(xt) [46] with shared parameters θ is trained to minimize the score matching (SM) objective [20] as

J t SM(θ) 1

2Eqt(xt) st θ(xt) xt log qt(xt) 2 2 . (5)

To eliminate the intractable computation of xt log qt(xt), denoising score matching (DSM) [49]

transforms J t SM(θ) into J t DSM(θ) 1

2Eq0(x0),q(ϵ)

st θ(xt) + ϵ σt

, where xt = αtx0 + σtϵ and

q(ϵ) = N(ϵ|0, I) is a standard Gaussian distribution. Under mild boundary conditions, we know J t SM(θ) and J t DSM(θ) is equivalent up to a constant, i.e., J t SM(θ) = J t DSM(θ) + Ct and Ct is a constant independent of the model parameters θ. Other SM variants [31, 44] are also applicable here. The total SM objective for training is a weighted sum of J t SM(θ) across t [0, T], defined as JSM(θ; λ(t)) R T 0 λ(t)J t SM(θ)dt, where λ(t) is a positive weighting function. Similarly, the total DSM objective is JDSM(θ; λ(t)) R T 0 λ(t)J t DSM(θ)dt. The training objectives under other model parametrizations such as noise prediction ϵt θ(xt) [17, 33], data prediction xt θ(xt) [23, 32], and velocity prediction vt θ(xt) [18, 38] are recapped in Appendix B.1.

2.3 Likelihood of DPMs

Suppose that the reverse processes start from a tractable prior p T (x T ) = N(x T |0, eσ2I). We can approximate the reverse-time SDE process by substituting xt log qt(xt) with st θ(xt) in Eq. (3) as dxt = f(t)xt g(t)2st θ(xt) dt + g(t)dωt, which induces the marginal distribution p SDE t (xt; θ) for t [0, T]. In particular, at t = 0, the KL divergence between q0(x0) and p SDE 0 (x0; θ) can be bounded by the total SM objective JSM(θ; g(t)2) with the weighing function of g(t)2, as stated below: Lemma 1. (Proof in Song et al. [45]) Let qt(xt) be constructed from the forward process in Eq. (2). Then under regularity conditions, we have DKL q0 p SDE 0 (θ) JSM(θ; g(t)2) + DKL (q T p T ).

Here DKL (q T p T ) is the prior loss independent of θ. Similarly, we approximate the reverse-time ODE process by substituting xt log qt(xt) with st θ(xt) in Eq. (4) as dxt

dt = f(t)xt 1

2g(t)2st θ(xt), which induces the marginal distribution p ODE t (xt; θ) for t [0, T]. By the instantaneous change of variables formula [6], we have log p ODE t (xt;θ) dt = tr xt f(t)xt 1

2g(t)2st θ(xt) , where tr( ) denotes the trace of a matrix. Integrating change in log p ODE t (xt; θ) from t = 0 to T can give the value of log p T (x T ) log p ODE 0 (x0; θ), but requires tracking the path from x0 to x T . On the other hand, at t = 0, the KL divergence between q0(x0) and p ODE 0 (x0; θ) can be decomposed: Lemma 2. (Proof in Lu et al. [28]) Let qt(xt) be constructed from the forward process in Eq. (2). Then under regularity conditions, we have DKL q0 p ODE 0 (θ) = JSM(θ; g(t)2) + DKL (q T p T ) + JDiff(θ), where the term JDiff(θ) measures the difference between st θ(xt) and xt log p ODE t (xt; θ).

Directly computing JDiff(θ) is intractable due to the term xt log p ODE t (xt; θ), nevertheless, we could bound JDiff(θ) via bounding high-order SM objectives [28].

3 Calibrating pretrained DPMs

In this section we begin with deriving the relationship between data scores at different timesteps, which leads us to a straightforward method for calibrating any pretrained DPMs. We investigate further how the dataset bias of finite samples prevents empirical learning from achieving calibration.

3.1 The stochastic process of data score

According to Kingma et al. [23], the form of the forward process in Eq. (1) can be generalized to any two timesteps 0 s < t T. Then, the transition probability from xs to xt is written as qst(xt|xs) = N xt αt|sxs, σ2 t|s I , where αt|s = αt

αs and σ2 t|s = σ2 t α2 t|sσ2 s. Here the marginal

distribution satisfies qt(xt) = R qst(xt|xs)qs(xs)dxs. We can generally derive the connection between data scores xt log qt(xt) and xs log qs(xs) as stated below: Theorem 1. (Proof in Appendix A.1) Let qt(xt) be constructed from the forward process in Eq. (2). Then under some regularity conditions, we have 0 s < t T,

αt xt log qt(xt) = Eqst(xs|xt) [αs xs log qs(xs)] , (6)

where qst(xs|xt) = qst(xt|xs)qs(xs)

qt(xt) is the transition probability from xt to xs.

Theorem 1 indicates that the stochastic process of αt xt log qt(xt) is a martingale w.r.t. the reversetime process of xt from timestep T to 0. From the optional stopping theorem [14], the expected

value of a martingale at a stopping time is equal to its initial expected value Eq0(x0) [ x0 log q0(x0)]. It is known that, under a mild boundary condition on q0(x0), there is Eq0(x0) [ x0 log q0(x0)] = 0 (proof is recapped in Appendix A.2). Consequently, as to the stochastic process, the martingale property results in Eqt(xt) [ xt log qt(xt)] = 0 for t [0, T]. Moreover, the martingale property of the (scaled) data score αt xt log qt(xt) leads to concentration bounds using Azuma s inequality and Doob s martingale inequality as derived in Appendix A.3. Although we do not use these concentration bounds further in this paper, there are other concurrent works that use roughly similar concentration bounds in diffusion models, such as proving consistency [47] or justifying trajectory retrieval [52].

3.2 A simple calibration trick

Given a pretrained model st θ(xt) in practice, there is usually Eqt(xt) [st θ(xt)] = 0, despite the fact that the expect data score is zero as Eqt(xt) [ xt log qt(xt)] = 0. This motivates us to calibrate st θ(xt) to st θ(xt) ηt, where ηt is a time-dependent calibration term that is independent of any particular input xt. The calibrated SM objective is written as follows:

J t SM(θ, ηt) 1

2Eqt(xt) st θ(xt) ηt xt log qt(xt) 2 2

= J t SM(θ) Eqt(xt) st θ(xt) ηt + 1

2 ηt 2 2, (7)

where the second equation holds after the results of Eqt(xt) [ xt log qt(xt)] = 0, and there is J t SM(θ, 0) = J t SM(θ) specifically when ηt = 0. Note that the orange part in Eq. (7) is a quadratic function w.r.t. ηt. We look for the optimal η t = arg minηt J t SM(θ, ηt) that minimizes the calibrated SM objective, from which we can derive η t = Eqt(xt) st θ(xt) . (8)

After taking η t into J t SM(θ, ηt), we have

J t SM(θ, η t ) = J t SM(θ) 1

Eqt(xt) st θ(xt) 2 2 . (9)

Since there is J t SM(θ) = J t DSM(θ) + Ct, we have J t DSM(θ, η t ) = J t DSM(θ) 1

2 Eqt(xt) [st θ(xt)] 2 2 for the DSM objective. Similar calibration tricks are also valid under other model parametrizations and SM variants, as formally described in Appendix B.2.

Remark. For any pretrained score model st θ(xt), we can calibrate it into st θ(xt) Eqt(xt) [st θ(xt)],

which reduces the SM/DSM objectives at timestep t by 1

2 Eqt(xt) [st θ(xt)] 2 2. The expectation of the calibrated score model is always zero, i.e., Eqt(xt) st θ(xt) Eqt(xt) [st θ(xt)] = 0 holds for any θ, which is consistent with Eqt(xt) [ xt log qt(xt)] = 0 satisfied by data scores.

Calibration preserves conservativeness. A theoretical flaw of score-based modeling is that st θ(xt) may not correspond to a probability distribution. To solve this issue, Salimans and Ho [37] develop an energy-based model design, which utilizes the power of score-based modeling and simultaneously makes sure that st θ(xt) is conservative, i.e., there exists a probability distribution pt θ(xt) such that xt Rk, we have st θ(xt) = xt log pt θ(xt). In this case, after we calibrate st θ(xt) by subtracting

ηt, there is st θ(xt) ηt = xt log pt θ(xt) exp(x t ηt)Zt(θ)

, where Zt(θ) = R pt θ(xt) exp x t ηt dxt

represents the normalization factor. Intuitively, subtracting by ηt corresponds to a shift in the vector space, so if st θ(xt) is conservative, its calibrated version st θ(xt) ηt is also conservative.

Conditional cases. As to the conditional DPMs, we usually employ a conditional model st θ(xt, y), where y Y is the conditional context (e.g., class label or text prompt). To learn the conditional data score xt log qt(xt|y) = xt log qt(xt, y), we minimize the SM objective defined as J t SM(θ) 1 2Eqt(xt,y) st θ(xt, y) xt log qt(xt, y) 2 2 . Similar to the conclusion of Eqt(xt) [ xt log qt(xt)] = 0, there is Eqt(xt|y) [ xt log qt(xt|y)] = 0. To calibrate st θ(xt, y), we use the conditional term ηt(y) and the calibrated SM objective is formulated as

J t SM(θ, ηt(y)) 1

2Eqt(xt,y) st θ(xt, y) ηt(y) xt log qt(xt, y) 2 2

= J t SM(θ) Eqt(xt,y)

st θ(xt, y) ηt(y) + 1

2 ηt(y) 2 2

and for any y Y, the optimal η t (y) is given by η t (y) = Eqt(xt|y) [st θ(xt, y)]. We highlight the conditional context y in contrast to the unconditional form in Eq. (7). After taking η t (y) into J t SM(θ, ηt(y)), we have J t SM(θ, η t (y)) = J t SM(θ) 1

2Eqt(y) h Eqt(xt|y) [st θ(xt, y)] 2 2

i . This conditional calibration form can naturally generalize to other model parametrizations and SM variants.

3.3 Likelihood of calibrated DPMs

Now we discuss the effects of calibration on model likelihood. Following the notations in Section 2.3, we use p SDE 0 (θ, ηt) and p ODE 0 (θ, ηt) to denote the distributions induced by the reverse-time SDE and ODE processes, respectively, where xt log qt(xt) is substituted with st θ(xt) ηt.

Likelihood of p SDE 0 (θ, ηt). Let JSM(θ, ηt; g(t)2) R T 0 g(t)2J t SM(θ, ηt)dt be the total SM objective after the score model is calibrated by ηt, then according to Lemma 1, we have DKL q0 p SDE 0 (θ, ηt) JSM(θ, ηt; g(t)2) + DKL (q T p T ). From the result in Eq. (9), there is

JSM(θ, η t ; g(t)2) = JSM(θ; g(t)2) 1

0 g(t)2 Eqt(xt) st θ(xt) 2 2 dt. (11)

Therefore, the likelihood DKL q0 p SDE 0 (θ, η t ) after calibration has a lower upper bound of JSM(θ, η t ; g(t)2) + DKL (q T p T ), compared to the bound of JSM(θ; g(t)2) + DKL (q T p T ) for the original DKL q0 p SDE 0 (θ) . However, we need to clarify that DKL q0 p SDE 0 (θ, η t ) may not necessarily smaller than DKL q0 p SDE 0 (θ) , since we can only compare their upper bounds.

Likelihood of p ODE 0 (θ, ηt). Note that in Lemma 2, there is a term JDiff(θ), which is usually small in practice since st θ(xt) and xt log p ODE t (xt; θ) are close. Thus, we have

DKL q0 p ODE 0 (θ, ηt) JSM(θ, ηt; g(t)2) + DKL (q T p T ) ,

and DKL q0 p ODE 0 (θ, η t ) approximately achieves its lowest value. Lu et al. [28] show that DKL q0 p ODE 0 (θ) can be further bounded by high-order SM objectives (as detailed in Appendix A.4), which depend on xtst θ(xt) and xttr ( xtst θ(xt)). Since the calibration term ηt is independent of xt, i.e., xtηt = 0, it does not affect the values of high-order SM objectives, and achieves a lower upper bound due to the lower value of the first-order SM objective.

3.4 Empirical learning fails to achieve Eqt(xt) [st θ(xt)] = 0

A question that naturally arises is whether better architectures or learning algorithms for DPMs (e.g., EDMs [22]) could empirically achieve Eqt(xt) [st θ(xt)] = 0 without calibration? The answer may be negative, since in practice we only have access to a finite dataset sampled from q0(x0). More specifically, assuming that we have a training dataset D {xn 0}N n=1 where xn 0 q0(x0), and defining the kernel density distribution induced by D as qt(xt; D) PN n=1 N xt αtxn 0 σt 0, I . When the quantity of training data approaches infinity, we have lim N qt(xt; D) = qt(xt) holds for t [0, T]. Then the empirical DSM objective trained on D is written as

J t DSM(θ; D) 1 2N

" st θ(αtxn 0 + σtϵ) + ϵ

and it is easy to show that the optimal solution for minimizing J t DSM(θ; D) satisfies (assuming st θ has universal model capacity) st θ(xt) = xt log qt(xt; D). Given a finite dataset D, there is Eqt(xt;D) [ xt log qt(xt; D)] = 0, but typically Eqt(xt) [ xt log qt(xt; D)] = 0, (13)

indicating that even if the score model is learned to be optimal, there is still Eqt(xt) [st θ(xt)] = 0. Thus, the mis-calibration of DPMs is partially due to the dataset bias, i.e., during training we can only access a finite dataset D sampled from q0(x0).

Furthermore, when trained on a finite dataset in practice, the learned model will not converge to the optimal solution [15], so there is typically st θ(xt) = xt log qt(xt; D) and Eqt(xt;D) [st θ(xt)] = 0. After calibration, we can at least guarantee that Eqt(xt;D) st θ(xt) Eqt(xt;D) [st θ(xt)] = 0 always holds on any finite dataset D. In Figure 3, we demonstrate that even state-of-the-art EDMs still have non-zero and semantic Eqt(xt) [st θ(xt)], which emphasises the significance of calibrating DPMs.

0.0 0.2 0.4 0.6 0.8 1.0

t k Eqt(xt)[ t µ(xt)]k2 2

0.0 0.2 0.4 0.6 0.8 1.0

t k Eqt(xt)[ t µ(xt)]k2 2

0 200 400 600 800 1000

2k Eqt(xt)[ t µ(xt)]k2 2

0 200 400 600 800 1000

t k Eqt(xt)[ t µ(xt)]k2 2

0 200 400 600 800 1000

t k Eqt(xt)[ t µ(xt)]k2 2

0 200 400 600 800 1000

2k Eqt(xt)[ t µ(xt)]k2 2

0.0 0.2 0.4 0.6 0.8 1.0

2k Eqt(xt)[ t µ(xt)]k2 2

0.0 0.2 0.4 0.6 0.8 1.0

2k Eqt(xt)[ t µ(xt)]k2 2

0.0 0.2 0.4 0.6 0.8 1.0

k Eqt(xt|y)[ t

µ(xt, y)]k2 2

0.0 0.2 0.4 0.6 0.8 1.0

k Eqt(xt|y)[ t

µ(xt, y)]k2 2

Image Net 64 64 FFHQ 64 64 AFHQv2 64 64 Celeb A 64 64 CIFAR-10

Figure 1: Time-dependent values of 1

2 Eqt(xt) [ϵt θ(xt)] 2 2 (the first row) and g(t)2

2σ2 t Eqt(xt) [ϵt θ(xt)] 2 2 (the second row) calculated on different datasets. The models on CIFAR-10 and Celeb A is trained on discrete timesteps (t = 0, 1, , 1000), while those on AFHQv2, FFHQ, and Image Net are trained on continuous timesteps (t [0, 1]). We convert data prediction xt θ(xt) into noise prediction ϵt θ(xt) based on ϵt θ(xt) = (xt αtxt θ(xt))/σt. The y-axis is clamped into [0, 500].

3.5 Amortized computation of Eqt(xt) [st θ(xt)]

By default, we are able to calculate and restore the value of Eqt(xt) [st θ(xt)] for a pretrained model st θ(xt), where the selection of timestep t is determined by the inference algorithm, and the expectation over qt(xt) can be approximated by Monte Carlo sampling from a noisy training set. When we do not have access to training data, we can approximate the expectation using data generated from p ODE t (xt; θ) or p SDE t (xt; θ). Since we only need to calculate Eqt(xt) [st θ(xt)] once, the raised computational overhead is amortized as the number of generated samples increases.

Dynamically recording. In the preceding context, we focus primarily on post-training computing of Eqt(xt) [st θ(xt)]. An alternative strategy would be to dynamically record Eqt(xt) [st θ(xt)] during the pretraining phase of st θ(xt). Specifically, we could construct an auxiliary shallow network hϕ(t) parameterized by ϕ, whose input is the timestep t. We define the expected mean squared error as

J t Cal(ϕ) Eqt(xt) hϕ(t) st θ(xt) 2 2 , (14)

where the superscript denotes the stopping gradient and ϕ is the optimal solution of minimizing J t Cal(ϕ) w.r.t. ϕ, satisfying hϕ (t) = η t = Eqt(xt) [st θ(xt)] (assuming sufficient model capacity). The total training objective can therefore be expressed as JSM(θ; λ(t)) + R T 0 βt J t Cal(ϕ), where βt is a time-dependent trade-off coefficient for t [0, T].

4 Experiments

In this section, we demonstrate that sample quality and model likelihood can be both improved by calibrating DPMs. Instead of establishing a new state-of-the-art, the purpose of our empirical studies is to testify the efficacy of our calibration technique as a simple way to repair DPMs.

4.1 Sample quality

Setup. We apply post-training calibration to discrete-time models trained on CIFAR-10 [25] and Celeb A [27], which apply parametrization of noise prediction ϵt θ(xt). In the sampling phase, we employ DPM-Solver [29], an ODE-based sampler that achieves a promising balance between sample efficiency and image quality. Because our calibration directly acts on model scores, it is also compatible with other ODE/SDE-based samplers [3, 26], while we only focus on DPM-Solver cases in this paper. In accordance with the recommendation, we set the end time of DPM-Solver to 10 3 when the number of sampling steps is less than 15, and to 10 4 otherwise. Additional details can be

Table 1: Comparison on sample quality measured by FID with varying NFE on CIFAR-10. Experiments are conducted using a linear noise schedule on the discrete-time model from [17]. We consider three variants of DPM-Solver with different orders. The results with mean the actual NFE is order NFE

order which is smaller than the given NFE, following the setting in [29].

Noise prediction DPM-Solver Number of evaluations (NFE) 10 15 20 25 30 35 40

ϵt θ(xt) 1-order 20.49 12.47 9.72 7.89 6.84 6.22 5.75 2-order 7.35 4.52 4.14 3.92 3.74 3.71 3.68 3-order 23.96 4.61 3.89 3.73 3.65 3.65 3.60

ϵt θ(xt) Eqt(xt) [ϵt θ(xt)] 1-order 19.31 11.77 8.86 7.35 6.28 5.76 5.36 2-order 6.76 4.36 4.03 3.66 3.54 3.44 3.48 3-order 53.50 4.22 3.32 3.33 3.35 3.32 3.31

Table 2: Comparison on sample quality measured by FID with varying NFE on Celeb A 64 64. Experiments are conducted using a linear noise schedule on the discrete-time model from [41]. The settings of DPM-Solver are the same as on CIFAR-10.

Noise prediction DPM-Solver Number of evaluations (NFE) 10 15 20 25 30 35 40

ϵt θ(xt) 1-order 16.74 11.85 7.93 6.67 5.90 5.38 5.01 2-order 4.32 3.98 2.94 2.88 2.88 2.88 2.84 3-order 11.92 3.91 2.84 2.76 2.82 2.81 2.85

ϵt θ(xt) Eqt(xt) [ϵt θ(xt)] 1-order 16.13 11.29 7.09 6.06 5.28 4.87 4.39 2-order 4.42 3.94 2.61 2.66 2.54 2.52 2.49 3-order 35.47 3.62 2.33 2.43 2.40 2.43 2.49

found in Lu et al. [29]. By default, we employ the FID score [16] to quantify the sample quality using 50,000 samples. Typically, a lower FID indicates a higher sample quality. In addition, in Table 3, we evaluate using other metrics such as s FID [30], IS [39], and Precision/Recall [36].

Computing Eqt(xt) [ϵt θ(xt)]. To estimate the expectation over qt(xt), we construct xt = αtx0 + σtϵ, where x0 q0(x0) is sampled from the training set and ϵ N(ϵ|0, I) is sampled from a standard Gaussian distribution. The selection of timestep t depends on the sampling schedule of DPM-Solver. The computed values of Eqt(xt) [ϵt θ(xt)] are restored in a dictionary and warped into the output layers of DPMs, allowing existing inference pipelines to be reused.

We first calibrate the model trained by Ho et al. [17] on the CIFAR-10 dataset and compare it to the original one for sampling with DPM-Solvers. We conduct a systematical study with varying NFE (i.e., number of function evaluations) and solver order. The results are presented in Tables 1 and 3. After calibrating the model, the sample quality is consistently enhanced, which demonstrates the significance of doing so and the efficacy of our method. We highlight the significant improvement in sample quality (4.61 4.22 when using 15 NFE and 3-order DPM-Solver; 3.89 3.32 when using 20 NFE and 3-order DPM-Solver). After model calibration, the number of steps required to achieve convergence for a 3-order DPM-Solver is reduced from 30 to 20, making our method a new option for expediting the sampling of DPMs. In addition, as a point of comparison, the 3-order DPM-Solver with 1,000 NFE can only yield an FID score of 3.45 when using the original model, which, along with the results in Table 1, indicates that model calibration helps to improve the convergence of sampling.

Then, we conduct experiments with the discrete-time model trained on the Celeb A 64x64 dataset by Song et al. [41]. The corresponding sample quality comparison is shown in Table 2. Clearly, model calibration brings significant gains (3.91 3.62 when using 15 NFE and 3-order DPM-Solver; 2.84 2.33 when using 20 NFE and 3-order DPM-Solver) that are consistent with those on the CIFAR-10 dataset. This demonstrates the prevalence of the mis-calibration issue in existing DPMs and the efficacy of our correction. We still observe that model calibration improves convergence of sampling, and as shown in Figure 2, our calibration could help to reduce ambiguous generations. More generated images are displayed in Appendix C.

Table 3: Comparison on sample quality measured by different metrics, including FID , s FID , inception score (IS) , precision and recall with varying NFE on CIFAR-10. We use Base to denote the baseline ϵt θ(xt) and Ours to denote calibrated score ϵt θ(xt) Eqt(xt) [ϵt θ(xt)]. The sampler is DPM-Solver with different orders. Note that FID is computed by the Py Torch checkpoint of Inception-v3, while s FID/IS/Precision/Recall are computed by the Tensorflow checkpoint of Inception-v3 following github.com/kynkaat/improved-precision-and-recall-metric.

Number of evaluations (NFE) 20 25 30 FID s FID IS Pre. Rec. FID s FID IS Pre. Rec. FID s FID IS Pre. Rec.

Base 1-ord. 9.72 6.03 8.49 0.641 0.542 7.89 5.45 8.68 0.644 0.556 6.84 5.12 8.76 0.650 0.565 2-ord. 4.14 4.36 9.15 0.654 0.590 3.92 4.22 9.17 0.657 0.591 3.74 4.18 9.20 0.658 0.591 3-ord. 3.89 4.18 9.29 0.652 0.597 3.73 4.15 9.21 0.657 0.595 3.65 4.12 9.22 0.658 0.593

Ours 1-ord. 8.86 6.01 8.56 0.649 0.544 7.35 5.42 8.76 0.653 0.560 6.28 5.09 8.84 0.653 0.568 2-ord. 4.03 4.31 9.17 0.661 0.592 3.66 4.20 9.20 0.664 0.594 3.54 4.14 9.23 0.662 0.599 3-ord. 3.32 4.14 9.38 0.657 0.603 3.33 4.11 9.28 0.665 0.597 3.35 4.08 9.27 0.662 0.600

w/o calibration (baseline)

w/ calibration (ours)

Figure 2: Selected images on CIFAR-10 (generated with NFE = 20 using 3-order DPM-Solver) demonstrating that our calibration could reduce ambiguous generations, such as generations that resemble both horse and dog. However, we must emphasize that not all generated images have a visually discernible difference before and after calibration.

4.2 Model likelihood

As described in Section 3.3, calibration contributes to reducing the SM objective, thereby decreasing the upper bound of the KL divergence between model distribution at timestep t = 0 (either p SDE 0 (θ, η t ) or p ODE 0 (θ, η t )) and data distribution q0. Consequently, it aids in raising the lower bound of model likelihood. In this subsection, we examine such effects by evaluating the aforementioned DPMs on the CIFAR-10 and Celeb A datasets. We also conduct experiments with continuous-time models trained by Karras et al. [22] on AFHQv2 64 64 [7], FFHQ 64 64 [21], and Image Net 64 64 [9] datasets considering their top performance. These models apply parametrization of data prediction xt θ(xt), and for consistency, we convert it to align with ϵt θ(xt) based on the relationship ϵt θ(xt) = (xt αtxt θ(xt))/σt, as detailed in Kingma et al. [23] and Appendix B.2.

Given that we employ noise prediction models in practice, we first estimate 1

2 Eqt(xt) [ϵt θ(xt)] 2 2 at timestep t [0, T], which reflects the decrement on the SM objective at t according to Eq. (9) (up to a scaling factor of 1/σ2 t ). We approximate the expectation using Monte Carlo (MC) estimation with training data points. The results are displayed in the first row of Figure 1. Notably, the value of 1 2 Eqt(xt) [ϵt θ(xt)] 2 2 varies significantly along with timestep t: it decreases relative to t for Celeb A but increases in all other cases (except for t [0.4, 1.0] on Image Net 64 64). Ideally, there should be 1

2 Eqt(xt) [ xt log qt(xt)] 2 2 = 0 at any t. Such inconsistency reveals that mis-calibration issues exist in general, although the phenomenon may vary across datasets and training mechanisms.

Then, we quantify the gain of model calibration on increasing the lower bound of model likelihood, which is 1

2 R T 0 g(t)2 Eqt(xt) [st θ(xt)] 2 2 dt according to Eq. (11). We first rewrite it with the model parametrization of noise prediction ϵt θ(xt), and it can be straightforwardly demonstrated that it equals R T 0 g(t)2

2σ2 t Eqt(xt) [ϵt θ(xt)] 2 2. Therefore, we calculate the value of g(t)2

2σ2 t Eqt(xt) [ϵt θ(xt)] 2 2 using MC

Celeb A 64 64 FFHQ 64 64

Timestep 𝑡 0 𝑇

Figure 3: Visualization of the expected predicted noises with increasing t. For each dataset, the first row displays Eqt(xt) [ϵt θ(xt)] (after normalization) and the second row highlights the top-10% pixels that Eqt(xt) [ϵt θ(xt)] has high values. The DPM on Celeb A is a discrete-time model with 1000 timesteps [41] and that on FFHQ is a continuous-time one [22].

estimation and report the results in the second row of Figure 1. The integral is represented by the area under the curve (i.e., the gain of model calibration on the lower bound of model likelihood). Various datasets and model architectures exhibit non-trivial gains, as observed. In addition, we notice that the DPMs trained by Karras et al. [22] show patterns distinct from those of DDPM [17] and DDIM [41], indicating that different DPM training mechanisms may result in different mis-calibration effects.

Visualizing Eqt(xt) [ϵt θ(xt)]. To better understand the inductive bias learned by DPMs, we visualize the expected predicted noises Eqt(xt) [ϵt θ(xt)] for timestep from 0 to T, as seen in Figure 3. For each dataset, the first row normalizes the values of Eqt(xt) [ϵt θ(xt)] into [0, 255]; the second row calculates pixel-wise norm (across RGB channels) and highlights the top-10% locations with the highest norm. As we can observe, on facial datasets like Celeb A and FFHQ, there are obvious facial patterns inside Eqt(xt) [ϵt θ(xt)], while on other datasets like CIFAR-10, Image Net, as well as the animal face dataset AFHQv2, the patterns inside Eqt(xt) [ϵt θ(xt)] are more like random noises. Besides, the facial patterns in Figure 3 are more significant when t is smaller, and become blurry when t is close to T. This phenomenon may be attributed to the bias of finite training data, which is detrimental to generalization during sampling and justifies the importance of calibration as described in Section 3.4.

4.3 Ablation studies

We conduct ablation studies focusing on the estimation methods of Eqt(xt) [ϵt θ(xt)].

Estimating Eqt(xt) [ϵt θ(xt)] with partial training data. In the post-training calibration setting, our primary algorithmic change is to subtract the calibration term Eqt(xt) [ϵt θ(xt)] from the pretrained DPMs output. In the aforementioned studies, the expectation in Eqt(xt) [ϵt θ(xt)] (or its variant of other model parametrizations) is approximated with MC estimation using all training images. However, there may be situations where training data are (partially) inaccessible. To evaluate the effectiveness of our method under these cases, we examine the number of training images used to estimate the calibration term on CIFAR-10. To determine the quality of the estimated calibration term, we sample from the calibrated models using a 3-order DPM-Solver running for 20 steps and evaluate the corresponding FID score. The results are listed in the left part of Table 4. As observed, we need to use the majority of training images (at least 20,000) to estimate the calibration term. We deduce that this is because the CIFAR-10 images are rich in diversity, necessitating a non-trivial number of training images to cover the various modes and produce a nearly unbiased calibration term.

Estimating Eqt(xt) [ϵt θ(xt)] with generated data. In the most extreme case where we do not have access to any training data (e.g., due to privacy concerns), we could still estimate the expectation over qt(xt) with data generated from p ODE 0 (x0; θ) or p SDE 0 (x0; θ). Specifically, under the hypothesis that p ODE 0 (x0; θ) q0(x0) (DPM-Solver is an ODE-based sampler), we first generate ex0 p ODE 0 (x0; θ)

Table 4: Sample quality varies w.r.t. the number of training images (left part) and generated images (right part) used to estimate the calibration term on CIFAR-10. In the generated data case, the images used to estimate the calibration term Eqt(xt) [ϵt θ(xt)] is crafted with 50 sampling steps by a 3-order DPM-Solver.

Training data Generated data # of samples FID # of samples FID

500 55.38 2,000 8.80 1,000 18.72 5,000 4.53 2,000 8.05 10,000 3.78 5,000 4.31 20,000 3.31 10,000 3.47 50,000 3.46 20,000 3.25 100,000 3.47 50,000 3.32 200,000 3.46

Figure 4: Dynamically recording Eqt(xt) [ϵt θ(xt)]. During training, the mean square error between the ground truth and the outputs of a shallow network for recording the calibration terms rapidly decreases, across different timesteps t.

0 200 400 600 800 1000 Training Epoch

Mean Square Error

t=0 t=250 t=500 t=750 t=999

and construct ext = αtex0 + σtϵ, where ext p ODE t (xt; θ). Then, the expectation over qt(xt) could be approximated by the expectation over p ODE t (xt; θ).

Empirically, on the CIFAR-10 dataset, we adopt a 3-order DPM-Solver to generate a set of samples from the pretrained model of Ho et al. [17], using a relatively large number of sampling steps (e.g., 50 steps). This set of generated data is used to calculate the calibration term Eqt(xt) [ϵt θ(xt)]. Then, we obtain the calibrated model ϵt θ(xt) Eqt(xt) [ϵt θ(xt)] and craft new images based on a 3-order 20-step DPM-Solver. In the right part of Table 4, we present the results of an empirical investigation into how the number of generated images influences the quality of model calibration.

Using the same sampling setting, we also provide two reference points: 1) the originally mis-calibrated model can reach the FID score of 3.89, and 2) the model calibrated with training data can reach the FID score of 3.32. Comparing these results reveals that the DPM calibrated with a large number of high-quality generations can achieve comparable FID scores to those calibrated with training samples (see the result of using 20,000 generated images). Additionally, it appears that using more generations is not advantageous. This may be because the generations from DPMs, despite being known to cover diverse modes, still exhibit semantic redundancy and deviate slightly from the data distribution.

Dynamical recording. We simulate the proposed dynamical recording technique. Specifically, we use a 3-layer MLP of width 512 to parameterize the aforementioned network hϕ(t) and train it with an Adam optimizer [24] to approximate the expected predicted noises Eqt(xt) [ϵt θ(xt)], where ϵt θ(xt) comes from the pretrained noise prediction model on CIFAR-10 [17]. The training of hϕ(t) runs for 1,000 epochs. Meanwhile, using the training data, we compute the expected predicted noises with MC estimation and treat them as the ground truth. In Figure 4, we compare them to the outputs of hϕ(t) and visualize the disparity measured by mean square error. As demonstrated, as the number of training epochs increases, the network hϕ(t) quickly converges and can form a relatively reliable approximation to the ground truth. Dynamic recording has a distinct advantage of being able to be performed during the training of DPMs to enable immediate generation. We clarify that better timestep embedding techniques and NN architectures can improve approximation quality even further.

5 Discussion

We propose a straightforward method for calibrating any pretrained DPM that can provably reduce the values of SM objectives and, as a result, induce higher values of lower bounds for model likelihood. We demonstrate that the mis-calibration of DPMs may be inherent due to the dataset bias and/or sub-optimally learned model scores. Our findings also provide a potentially new metric for assessing a diffusion model by its degree of uncalibration , namely, how far the learned scores deviate from the essential properties (e.g., the expected data scores should be zero).

Limitations. While our calibration method provably improves the model s likelihood, it does not necessarily yield a lower FID score, as previously discussed [45]. Besides, for text-to-image generation, post-training computation of Eqt(xt|y) [st θ(xt, y)] becomes infeasible due to the exponentially large number of conditions y, necessitating dynamic recording with multimodal modules.

Acknowledgements

Zhijie Deng was supported by Natural Science Foundation of Shanghai (No. 23ZR1428700) and the Key Research and Development Program of Shandong Province, China (No. 2023CXGC010112).

[1] Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357 367, 1967.

[2] Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie S Li, Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Cold diffusion: Inverting arbitrary image transforms without noise. ar Xiv preprint ar Xiv:2208.09392, 2022.

[3] Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In International Conference on Learning Representations (ICLR), 2022.

[4] Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. Learning gradient fields for shape generation. In European Conference on Computer Vision (ECCV), 2020.

[5] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations (ICLR), 2021.

[6] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

[7] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In IEEE International Conference on Computer Vision (CVPR), 2020.

[8] Giannis Daras, Mauricio Delbracio, Hossein Talebi, Alexandros G Dimakis, and Peyman Milanfar. Soft diffusion: Score matching for general corruptions. ar Xiv preprint ar Xiv:2209.05442, 2022.

[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

[10] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

[11] Joseph L Doob and Joseph L Doob. Stochastic processes, volume 7. Wiley New York, 1953.

[12] Bradley Efron. Tweedie s formula and selection bias. Journal of the American Statistical Association, 106(496):1602 1614, 2011.

[13] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. ar Xiv preprint ar Xiv:2208.01618, 2022.

[14] Geoffrey Grimmett and David Stirzaker. Probability and random processes. Oxford university press, 2001.

[15] Xiangming Gu, Chao Du, Tianyu Pang, Chongxuan Li, Min Lin, and Ye Wang. On memorization in diffusion models. ar Xiv preprint ar Xiv:2310.02664, 2023.

[16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (Neur IPS), pages 6626 6637, 2017.

[17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (Neur IPS), 2020.

[18] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. ar Xiv preprint ar Xiv:2210.02303, 2022.

[19] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. ar Xiv preprint ar Xiv:2204.03458, 2022.

[20] Aapo Hyv arinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research (JMLR), 6(Apr):695 709, 2005.

[21] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE International Conference on Computer Vision (CVPR), 2019.

[22] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems (Neur IPS), 2022.

[23] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

[24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[25] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[26] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In International Conference on Learning Representations (ICLR), 2022.

[27] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In International Conference on Computer Vision (ICCV), 2015.

[28] Cheng Lu, Kaiwen Zheng, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Maximum likelihood training for score-based diffusion odes by high order denoising score matching. In International Conference on Machine Learning (ICML), 2022.

[29] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Advances in Neural Information Processing Systems (Neur IPS), 2022.

[30] Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. ar Xiv preprint ar Xiv:2103.03841, 2021.

[31] Tianyu Pang, Kun Xu, Chongxuan Li, Yang Song, Stefano Ermon, and Jun Zhu. Efficient learning of generative models via finite-difference score matching. In Advances in Neural Information Processing Systems (Neur IPS), 2020.

[32] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022.

[33] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

[34] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. ar Xiv preprint ar Xiv:2208.12242, 2022.

[35] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022.

[36] Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

[37] Tim Salimans and Jonathan Ho. Should ebms model the energy or the score? In Energy Based Models Workshop-ICLR, 2021.

[38] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations (ICLR), 2022.

[39] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems (Neur IPS), 2016.

[40] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), pages 2256 2265. PMLR, 2015.

[41] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021.

[42] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems (Neur IPS), pages 11895 11907, 2019.

[43] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems (Neur IPS), 2020.

[44] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Conference on Uncertainty in Artificial Intelligence (UAI), 2019.

[45] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

[46] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021.

[47] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. ar Xiv preprint ar Xiv:2303.01469, 2023.

[48] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

[49] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011.

[50] Zekai Wang, Tianyu Pang, Chao Du, Min Lin, Weiwei Liu, and Shuicheng Yan. Better diffusion models further improve adversarial training. In International Conference on Machine Learning (ICML), 2023.

[51] Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations (ICLR), 2022.

[52] Kexun Zhang, Xianjun Yang, William Yang Wang, and Lei Li. Redi: Efficient learning-free diffusion inference via trajectory retrieval. ar Xiv preprint ar Xiv:2302.02285, 2023.

A Detailed derivations

In this section, we provide detailed derivations for the Theorem and equations shown in the main text. We follow the regularization assumptions listed in Song et al. [45].

A.1 Proof of Theorem 1

Proof. For any two timesteps 0 s < t T, i.e., the transition probability from xs to xt is written as qst(xt|xs) = N xt αt|sxs, σ2 t|s I , where αt|s = αt

αs and σ2 t|s = σ2 t α2 t|sσ2 s. The marginal

distribution qt(xt) = R qst(xt|xs)qs(xs)dxs and we have

xt log qt(xt) = 1 αt|s α 1 t|sxt log

1 αk t|s EN xs α 1 t|sxt,α 2 t|sσ2 t|s I [qs(xs)]

= 1 αt|s α 1 t|sxt log EN η 0,α 2 t|sσ2 t|s I h qs(α 1 t|s xt + η) i

= EN η 0,α 2 t|sσ2 t|s I h α 1 t|sxtqs(α 1 t|s xt + η) i

αt|s EN η 0,α 2 t|sσ2 t|s I h qs(α 1 t|s xt + η) i

= EN η 0,α 2 t|sσ2 t|s I h qs(α 1 t|s xt + η) α 1 t|sxt+η log qs(α 1 t|s xt + η) i

αt|s EN η 0,α 2 t|sσ2 t|s I h qs(α 1 t|s xt + η) i

= EN xs α 1 t|sxt,α 2 t|sσ2 t|s I [qs(xs) xs log qs(xs)]

αt|s EN xs α 1 t|sxt,α 2 t|sσ2 t|s I [qs(xs)]

R N xt αt|sxs, σ2 t|s I qs(xs) xs log qs(xs)dxs

αt|s R N xt αt|sxs, σ2 t|s I qs(xs)dxs

= 1 αt|s Eqst(xs|xt) [ xs log qs(xs)] .

Note that when the transition probability qst(xt|xs) corresponds to a well-defined forward process, there is αt > 0 for t [0, T], and thus we achieve αt xt log qt(xt) = Eqst(xs|xt) [αs xs log qs(xs)].

A.2 Proof of Eq0(x0) [ x0 log q0(x0)] = 0

Proof. The input variable x Rk and q0(x0) C2, where C2 denotes the family of functions with continuous second-order derivatives.1 We use xi denote the i-th element of x, then we can derive the expectation

xi 0 log q0(x0) = Z Z q0(x0)

xi 0 log q0(x0)dx1 0dx2 0 dxk 0

= Z Z xi 0 q0(x0)dx1 0dx2 0 dxk 0

Z q0(xi 0, x\i 0 )dx\i 0

= Z d dxi 0 q0(xi 0)dxi 0 = 0,

1This continuously differentiable assumption can be satisfied by adding a small Gaussian noise (e.g., with variance of 0.0001) on the original data distribution, as done in Song and Ermon [42].

where x\i 0 denotes all the k 1 elements in x0 except for the i-th one. The last equation holds under the boundary condition that limxi 0 q0(xi 0) = 0 hold for any i [K]. Thus, we achieve the conclusion that Eq0(x0) [ x0 log q0(x0)] = 0.

A.3 Concentration bounds

We describe concentration bounds [11, 1] of the martingale αt xt log qt(xt).

Azuma s inequality. For discrete reverse timestep t = T, T 1, , 0, Assuming that there exist constants 0 < c1, c2, , < such that for the i-th element of x,

At xi t 1 αt 1 log qt 1(xt 1) xi t αt log qt(xt) Bt and Bt At ct (17)

almost surely. Then ϵ > 0, the probability (note that α0 = 1)

P xi 0 log q0(x0) xi T αT log q T (x T ) ϵ 2 exp

2ϵ2 PT t=1 c2 t

Specially, considering that q T (x T ) N(x T |0, eσ2I), there is xi T log q T (x T ) xi T eσ2 . Thus, we can approximately obtain

P xi 0 log q0(x0) + αT xi T eσ2

2ϵ2 PT t=1 c2 t

Doob s inequality. For continuous reverse timestep t from T to 0, if the sample paths of the martingale are almost surely right-continuous, then for the i-th element of x we have (note that α0 = 1)

P sup 0 t T

xi t αt log qt(xt) C Eq0(x0) h max xi 0 log q0(x0), 0 i

A.4 High-order SM objectives

Lu et al. [28] show that the KL divergence DKL q0 p ODE 0 (θ) can be bounded as

DKL q0 p ODE 0 (θ) DKL (q T p T ) + p

JSM(θ; g(t)2) p

JFisher(θ), (21)

where JFisher(θ) is a weighted sum of Fisher divergence between qt(xt) and p ODE t (θ) as

JFisher(θ) = 1

0 g(t)2DF qt p ODE t (θ) dt. (22)

Moreover, Lu et al. [28] prove that if t [0, T] and xt Rk, there exist a constant CF such that the spectral norm of Hessian matrix 2 xt log p ODE t (xt; θ) 2 CF , and there exist δ1, δ2, δ3 > 0 such that st θ(xt) xt log qt(xt) 2 δ1,

xtst θ(xt) 2 xt log qt(xt) F δ2,

xttr xtst θ(xt) xttr 2 xt log qt(xt) 2 δ3,

where F is the Frobenius norm of matrix. Then there exist a function U(t; δ1, δ2, δ3, q) that independent of θ and strictly increasing (if g(t) = 0) w.r.t. δ1, δ2, and δ3, respectively, such that the Fisher divergence can be bounded as DF qt p ODE t (θ) U(t; δ1, δ2, δ3, q).

The case after calibration. When we impose the calibration term η t = Eqt(xt) [st θ(xt)] to get the score model st θ(xt) η t , there is xtη t = 0 and thus xt (st θ(xt) η t ) = xtst θ(xt). Then we have st θ(xt) η t xt log qt(xt) 2 δ 1 δ1,

xt st θ(xt) η t 2 xt log qt(xt) F δ2,

xttr xt st θ(xt) η t xttr 2 xt log qt(xt) 2 δ3.

From these, we know that the Fisher divergence DF qt p ODE t (θ, η t ) U(t; δ 1, δ2, δ3, q) U(t; δ1, δ2, δ3, q), namely, DF qt p ODE t (θ, η t ) has a lower upper bound compared to DF qt p ODE t (θ) . Consequently, we can get lower upper bounds for both JFisher(θ, η t ) and DKL q0 p ODE 0 (θ, η t ) , compared to JFisher(θ) and DKL q0 p ODE 0 (θ) , respectively.

B Model parametrization

This section introduces different parametrizations used in diffusion models and provides their calibrated instantiations.

B.1 Preliminary Along the research routine of diffusion models, different model parametrizations have been used, including score prediction st θ(xt) [42, 46], noise prediction ϵt θ(xt) [17, 33], data prediction xt θ(xt) [23, 32], and velocity prediction vt θ(xt) [38, 18]. Taking the DSM objective as the training loss, its instantiation at timestep t [0, T] is written as

J t DSM(θ) =

1 2Eq0(x0),q(ϵ) h st θ(xt) + ϵ σt 2 2 i , score prediction;

α2 t 2σ4 t Eq0(x0),q(ϵ) xt θ(xt) x0 2 2 , data prediction;

1 2σ2 t Eq0(x0),q(ϵ) ϵt θ(xt) ϵ 2 2 , noise prediction;

α2 t 2σ2 t Eq0(x0),q(ϵ) vt θ(xt) (αtϵ σtx0) 2 2 , velocity prediction.

B.2 Calibrated instantiation Under different model parametrizations, we can derive the optimal calibration terms η t that minimizing J t DSM(θ, ηt) as

Eqt(xt) [st θ(xt)] , score prediction;

Eqt(xt) [xt θ(xt)] Eq0(x0) [x0] , data prediction;

Eqt(xt) [ϵt θ(xt)] , noise prediction;

Eqt(xt) [vt θ(xt)] + σt Eq0(x0) [x0] , velocity prediction.

Taking η t into J t DSM(θ, ηt) we can obtain the gap

J t DSM(θ) J t DSM(θ, η t ) =

1 2 Eqt(xt) [st θ(xt)] 2 2, score prediction;

α2 t 2σ4 t Eqt(xt) [xt θ(xt)] Eq0(x0) [x0] 2 2, data prediction;

1 2σ2 t Eqt(xt) [ϵt θ(xt)] 2 2, noise prediction;

α2 t 2σ2 t Eqt(xt) [vt θ(xt)] + σt Eq0(x0) [x0] 2 2, velocity prediction.

C Visualization of the generations

We further show generated images in Figure 5 to double confirm the efficacy of our calibration method. Our calibration could help to reduce ambiguous generations on both CIFAR-10 and Celeb A.

(a) CIFAR-10, w/ calibration

(b) CIFAR-10, w/o calibration

(c) Celeb A, w/ calibration

(d) Celeb A, w/o calibration Figure 5: Unconditional generation results on CIFAR-10 and Celeb A using models from [17] and [41] respectively. The number of sampling steps is 20 based on the results in Tables 1 and 2.