# denoising_diffusion_implicit_models__4ed58682.pdf

Published as a conference paper at ICLR 2021

DENOISING DIFFUSION IMPLICIT MODELS

Jiaming Song, Chenlin Meng & Stefano Ermon Stanford University {tsong,chenlin,ermon}@cs.stanford.edu

Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps in order to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efﬁcient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is deﬁned as the reverse of a particular Markovian diffusion process. We generalize DDPMs via a class of non-Markovian diffusion processes that lead to the same training objective. These non-Markovian processes can correspond to generative processes that are deterministic, giving rise to implicit models that produce high quality samples much faster. We empirically demonstrate that DDIMs can produce high quality samples 10 to 50 faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, perform semantically meaningful image interpolation directly in the latent space, and reconstruct observations with very low error.

1 INTRODUCTION

Deep generative models have demonstrated the ability to produce high quality samples in many domains (Karras et al., 2020; van den Oord et al., 2016a). In terms of image generation, generative adversarial networks (GANs, Goodfellow et al. (2014)) currently exhibits higher sample quality than likelihood-based methods such as variational autoencoders (Kingma & Welling, 2013), autoregressive models (van den Oord et al., 2016b) and normalizing ﬂows (Rezende & Mohamed, 2015; Dinh et al., 2016). However, GANs require very speciﬁc choices in optimization and architectures in order to stabilize training (Arjovsky et al., 2017; Gulrajani et al., 2017; Karras et al., 2018; Brock et al., 2018), and could fail to cover modes of the data distribution (Zhao et al., 2018).

Recent works on iterative generative models (Bengio et al., 2014), such as denoising diffusion probabilistic models (DDPM, Ho et al. (2020)) and noise conditional score networks (NCSN, Song & Ermon (2019)) have demonstrated the ability to produce samples comparable to that of GANs, without having to perform adversarial training. To achieve this, many denoising autoencoding models are trained to denoise samples corrupted by various levels of Gaussian noise. Samples are then produced by a Markov chain which, starting from white noise, progressively denoises it into an image. This generative Markov Chain process is either based on Langevin dynamics (Song & Ermon, 2019) or obtained by reversing a forward diffusion process that progressively turns an image into noise (Sohl-Dickstein et al., 2015).

A critical drawback of these models is that they require many iterations to produce a high quality sample. For DDPMs, this is because that the generative process (from noise to data) approximates the reverse of the forward diffusion process (from data to noise), which could have thousands of steps; iterating over all the steps is required to produce a single sample, which is much slower compared to GANs, which only needs one pass through a network. For example, it takes around 20 hours to sample 50k images of size 32 32 from a DDPM, but less than a minute to do so from a GAN on a Nvidia 2080 Ti GPU. This becomes more problematic for larger images as sampling 50k images of size 256 256 could take nearly 1000 hours on the same GPU.

To close this efﬁciency gap between DDPMs and GANs, we present denoising diffusion implicit models (DDIMs). DDIMs are implicit probabilistic models (Mohamed & Lakshminarayanan, 2016) and are closely related to DDPMs, in the sense that they are trained with the same objective function.

Published as a conference paper at ICLR 2021

Figure 1: Graphical models for diffusion (left) and non-Markovian (right) inference models.

In Section 3, we generalize the forward diffusion process used by DDPMs, which is Markovian, to non-Markovian ones, for which we are still able to design suitable reverse generative Markov chains. We show that the resulting variational training objectives have a shared surrogate objective, which is exactly the objective used to train DDPM. Therefore, we can freely choose from a large family of generative models using the same neural network simply by choosing a different, non Markovian diffusion process (Section 4.1) and the corresponding reverse generative Markov Chain. In particular, we are able to use non-Markovian diffusion processes which lead to short generative Markov chains (Section 4.2) that can be simulated in a small number of steps. This can massively increase sample efﬁciency only at a minor cost in sample quality.

In Section 5, we demonstrate several empirical beneﬁts of DDIMs over DDPMs. First, DDIMs have superior sample generation quality compared to DDPMs, when we accelerate sampling by 10 to 100 using our proposed method. Second, DDIM samples have the following consistency property, which does not hold for DDPMs: if we start with the same initial latent variable and generate several samples with Markov chains of various lengths, these samples would have similar high-level features. Third, because of consistency in DDIMs, we can perform semantically meaningful image interpolation by manipulating the initial latent variable in DDIMs, unlike DDPMs which interpolates near the image space due to the stochastic generative process.

2 BACKGROUND

Given samples from a data distribution q(x0), we are interested in learning a model distribution pθ(x0) that approximates q(x0) and is easy to sample from. Denoising diffusion probabilistic models (DDPMs, Sohl-Dickstein et al. (2015); Ho et al. (2020)) are latent variable models of the form

pθ(x0) = Z pθ(x0:T )dx1:T , where pθ(x0:T ) := pθ(x T )

t=1 p(t) θ (xt 1|xt) (1)

where x1, . . . , x T are latent variables in the same sample space as x0 (denoted as X). The parameters θ are learned to ﬁt the data distribution q(x0) by maximizing a variational lower bound:

max θ Eq(x0)[log pθ(x0)] max θ Eq(x0,x1,...,x T ) [log pθ(x0:T ) log q(x1:T |x0)] (2)

where q(x1:T |x0) is some inference distribution over the latent variables. Unlike typical latent variable models (such as the variational autoencoder (Rezende et al., 2014)), DDPMs are learned with a ﬁxed (rather than trainable) inference procedure q(x1:T |x0), and latent variables are relatively high dimensional. For example, Ho et al. (2020) considered the following Markov chain with Gaussian transitions parameterized by a decreasing sequence α1:T (0, 1]T :

q(x1:T |x0) :=

t=1 q(xt|xt 1), where q(xt|xt 1) := N r αt

αt 1 xt 1, 1 αt αt 1

where the covariance matrix is ensured to have positive terms on its diagonal. This is called the forward process due to the autoregressive nature of the sampling procedure (from x0 to x T ). We call the latent variable model pθ(x0:T ), which is a Markov chain that samples from x T to x0, the generative process, since it approximates the intractable reverse process q(xt 1|xt). Intuitively, the forward process progressively adds noise to the observation x0, whereas the generative process progressively denoises a noisy observation (Figure 1, left).

A special property of the forward process is that

q(xt|x0) := Z q(x1:t|x0)dx1:(t 1) = N(xt; αtx0, (1 αt)I);

Published as a conference paper at ICLR 2021

so we can express xt as a linear combination of x0 and a noise variable ϵ:

xt = αtx0 +

1 αtϵ, where ϵ N(0, I). (4)

When we set αT sufﬁciently close to 0, q(x T |x0) converges to a standard Gaussian for all x0, so it is natural to set pθ(x T ) := N(0, I). If all the conditionals are modeled as Gaussians with trainable mean functions and ﬁxed variances, the objective in Eq. (2) can be simpliﬁed to1:

t=1 γt Ex0 q(x0),ϵt N(0,I) h ϵ(t) θ ( αtx0 +

1 αtϵt) ϵt 2

where ϵθ := {ϵ(t) θ }T t=1 is a set of T functions, each ϵ(t) θ : X X (indexed by t) is a function with trainable parameters θ(t), and γ := [γ1, . . . , γT ] is a vector of positive coefﬁcients in the objective that depends on α1:T . In Ho et al. (2020), the objective with γ = 1 is optimized instead to maximize generation performance of the trained model; this is also the same objective used in noise conditional score networks (Song & Ermon, 2019) based on score matching (Hyv arinen, 2005; Vincent, 2011). From a trained model, x0 is sampled by ﬁrst sampling x T from the prior pθ(x T ), and then sampling xt 1 from the generative processes iteratively.

The length T of the forward process is an important hyperparameter in DDPMs. From a variational perspective, a large T allows the reverse process to be close to a Gaussian (Sohl-Dickstein et al., 2015), so that the generative process modeled with Gaussian conditional distributions becomes a good approximation; this motivates the choice of large T values, such as T = 1000 in Ho et al. (2020). However, as all T iterations have to be performed sequentially, instead of in parallel, to obtain a sample x0, sampling from DDPMs is much slower than sampling from other deep generative models, which makes them impractical for tasks where compute is limited and latency is critical.

3 VARIATIONAL INFERENCE FOR NON-MARKOVIAN FORWARD PROCESSES

Because the generative model approximates the reverse of the inference process, we need to rethink the inference process in order to reduce the number of iterations required by the generative model. Our key observation is that the DDPM objective in the form of Lγ only depends on the marginals2 q(xt|x0), but not directly on the joint q(x1:T |x0). Since there are many inference distributions (joints) with the same marginals, we explore alternative inference processes that are non-Markovian, which leads to new generative processes (Figure 1, right). These non-Markovian inference process lead to the same surrogate objective function as DDPM, as we will show below. In Appendix A, we show that the non-Markovian perspective also applies beyond the Gaussian case.

3.1 NON-MARKOVIAN FORWARD PROCESSES

Let us consider a family Q of inference distributions, indexed by a real vector σ RT 0:

qσ(x1:T |x0) := qσ(x T |x0)

t=2 qσ(xt 1|xt, x0) (6)

where qσ(x T |x0) = N( αT x0, (1 αT )I) and for all t > 1,

qσ(xt 1|xt, x0) = N αt 1x0 + q

1 αt 1 σ2 t xt αtx0 1 αt , σ2 t I . (7)

The mean function is chosen to order to ensure that qσ(xt|x0) = N( αtx0, (1 αt)I) for all t (see Lemma 1 of Appendix B), so that it deﬁnes a joint inference distribution that matches the marginals as desired. The forward process3 can be derived from Bayes rule:

qσ(xt|xt 1, x0) = qσ(xt 1|xt, x0)qσ(xt|x0)

qσ(xt 1|x0) , (8)

1Please refer to Appendix C.2 for details. 2We slightly abuse this term (as well as joints) when only conditioned on x0. 3We overload the term forward process for cases where the inference model is not a diffusion.

Published as a conference paper at ICLR 2021

which is also Gaussian (although we do not use this fact for the remainder of this paper). Unlike the diffusion process in Eq. (3), the forward process here is no longer Markovian, since each xt could depend on both xt 1 and x0. The magnitude of σ controls the how stochastic the forward process is; when σ 0, we reach an extreme case where as long as we observe x0 and xt for some t, then xt 1 become known and ﬁxed.

3.2 GENERATIVE PROCESS AND UNIFIED VARIATIONAL INFERENCE OBJECTIVE

Next, we deﬁne a trainable generative process pθ(x0:T ) where each p(t) θ (xt 1|xt) leverages knowledge of qσ(xt 1|xt, x0). Intuitively, given a noisy observation xt, we ﬁrst make a prediction4 of the corresponding x0, and then use it to obtain a sample xt 1 through the reverse conditional distribution qσ(xt 1|xt, x0), which we have deﬁned.

For some x0 q(x0) and ϵt N(0, I), xt can be obtained using Eq. (4). The model ϵ(t) θ (xt) then attempts to predict ϵt from xt, without knowledge of x0. By rewriting Eq. (4), one can then predict the denoised observation, which is a prediction of x0 given xt:

f (t) θ (xt) := (xt

1 αt ϵ(t) θ (xt))/ αt. (9)

We can then deﬁne the generative process with a ﬁxed prior pθ(x T ) = N(0, I) and

p(t) θ (xt 1|xt) =

( N(f (1) θ (x1), σ2 1I) if t = 1 qσ(xt 1|xt, f (t) θ (xt)) otherwise, (10)

where qσ(xt 1|xt, f (t) θ (xt)) is deﬁned as in Eq. (7) with x0 replaced by f (t) θ (xt). We add some Gaussian noise (with covariance σ2 1I) for the case of t = 1 to ensure that the generative process is supported everywhere.

We optimize θ via the following variational inference objective (which is a functional over ϵθ):

Jσ(ϵθ) := Ex0:T qσ(x0:T )[log qσ(x1:T |x0) log pθ(x0:T )] (11)

= Ex0:T qσ(x0:T )

qσ(x T |x0) +

t=2 log qσ(xt 1|xt, x0)

t=1 log p(t) θ (xt 1|xt) log pθ(x T )

where we factorize qσ(x1:T |x0) according to Eq. (6) and pθ(x0:T ) according to Eq. (1).

From the deﬁnition of Jσ, it would appear that a different model has to be trained for every choice of σ, since it corresponds to a different variational objective (and a different generative process). However, Jσ is equivalent to Lγ for certain weights γ, as we show below.

Theorem 1. For all σ > 0, there exists γ RT >0 and C R, such that Jσ = Lγ + C.

The variational objective Lγ is special in the sense that if parameters θ of the models ϵ(t) θ are not shared across different t, then the optimal solution for ϵθ will not depend on the weights γ (as global optimum is achieved by separately maximizing each term in the sum). This property of Lγ has two implications. On the one hand, this justiﬁed the use of L1 as a surrogate objective function for the variational lower bound in DDPMs; on the other hand, since Jσ is equivalent to some Lγ from Theorem 1, the optimal solution of Jσ is also the same as that of L1. Therefore, if parameters are not shared across t in the model ϵθ, then the L1 objective used by Ho et al. (2020) can be used as a surrogate objective for the variational objective Jσ as well.

4 SAMPLING FROM GENERALIZED GENERATIVE PROCESSES

With L1 as the objective, we are not only learning a generative process for the Markovian inference process considered in Sohl-Dickstein et al. (2015) and Ho et al. (2020), but also generative processes for many non-Markovian forward processes parametrized by σ that we have described. Therefore, we can essentially use pretrained DDPM models as the solutions to the new objectives, and focus on ﬁnding a generative process that is better at producing samples subject to our needs by changing σ.

4Learning a distribution over the predictions is also possible, but empirically we found little beneﬁts of it.

Published as a conference paper at ICLR 2021

Figure 2: Graphical model for accelerated generation, where τ = [1, 3].

4.1 DENOISING DIFFUSION IMPLICIT MODELS

From pθ(x1:T ) in Eq. (10), one can generate a sample xt 1 from a sample xt via:

xt 1 = αt 1

xt 1 αtϵ(t) θ (xt) αt

| {z } predicted x0

1 αt 1 σ2 t ϵ(t) θ (xt) | {z } direction pointing to xt

+ σtϵt |{z} random noise (12)

where ϵt N(0, I) is standard Gaussian noise independent of xt, and we deﬁne α0 := 1. Different choices of σ values results in different generative processes, all while using the same model ϵθ, so re-training the model is unnecessary. When σt = p

(1 αt 1)/(1 αt) p

1 αt/αt 1 for all t, the forward process becomes Markovian, and the generative process becomes a DDPM.

We note another special case when σt = 0 for all t5; the forward process becomes deterministic given xt 1 and x0, except for t = 1; in the generative process, the coefﬁcient before the random noise ϵt becomes zero. The resulting model becomes an implicit probabilistic model (Mohamed & Lakshminarayanan, 2016), where samples are generated from latent variables with a ﬁxed procedure (from x T to x0). We name this the denoising diffusion implicit model (DDIM, pronounced /d:Im/), because it is an implicit probabilistic model trained with the DDPM objective (despite the forward process no longer being a diffusion).

4.2 ACCELERATED GENERATION PROCESSES

In the previous sections, the generative process is considered as the approximation to the reverse process; since of the forward process has T steps, the generative process is also forced to sample T steps. However, as the denoising objective L1 does not depend on the speciﬁc forward procedure as long as qσ(xt|x0) is ﬁxed, we may also consider forward processes with lengths smaller than T, which accelerates the corresponding generative processes without having to train a different model.

Let us consider the forward process as deﬁned not on all the latent variables x1:T , but on a subset {xτ1, . . . , xτS}, where τ is an increasing sub-sequence of [1, . . . , T] of length S. In particular, we deﬁne the sequential forward process over xτ1, . . . , xτS such that q(xτi|x0) = N( ατix0, (1 ατi)I) matches the marginals (see Figure 2 for an illustration). The generative process now samples latent variables according to reversed(τ), which we term (sampling) trajectory. When the length of the sampling trajectory is much smaller than T, we may achieve signiﬁcant increases in computational efﬁciency due to the iterative nature of the sampling process.

Using a similar argument as in Section 3, we can justify using the model trained with the L1 objective, so no changes are needed in training. We show that only slight changes to the updates in Eq. (12) are needed to obtain the new, faster generative processes, which applies to DDPM, DDIM, as well as all generative processes considered in Eq. (10). We include these details in Appendix C.1.

In principle, this means that we can train a model with an arbitrary number of forward steps but only sample from some of them in the generative process. Therefore, the trained model could consider many more steps than what is considered in (Ho et al., 2020) or even a continuous time variable t (Chen et al., 2020). We leave empirical investigations of this aspect as future work.

5Although this case is not covered in Theorem 1, we can always approximate it by making σt very small.

Published as a conference paper at ICLR 2021

4.3 RELEVANCE TO NEURAL ODES

Moreover, we can rewrite the DDIM iterate according to Eq. (12), and its similarity to Euler integration for solving ODEs becomes more apparent: s

1 αt 1 xt 1 = r

ϵ(t) θ (xt) (13)

We can reparameterize ( 1 α/ α) with λ and (x/ α) with H(λ) then sampling x0 with Equation (13) can be treated as integration over the following ODE:

M ϵλ θ(H(λ) p

λ2 1)dλ + H(M), H(M) N(0, I) (14)

for some very large M (which corresponds to the case of α 0). This suggests that with enough T (discretization steps), the we can also reverse the generation process (going from t = 0 to T), which encodes x0 to x T and simulates the reverse of the ODE in Eq. (14). This suggests that unlike DDPM, we can use DDIM to obtain encodings of the observations (as the form of x T ), which might be useful for other downstream applications that requires latent representations of a model.

5 EXPERIMENTS

In this section, we show that DDIMs outperform DDPMs in terms of image generation when fewer iterations are considered, giving speed ups of 10 to 100 over the original DDPM generation process. Moreover, unlike DDPMs, once the initial latent variables x T are ﬁxed, DDIMs retain highlevel image features regardless of the generation trajectory, so they are able to perform interpolation directly from the latent space. DDIMs can also be used to encode samples that reconstruct them from the latent code, which DDPMs cannot do due to the stochastic sampling process.

For each dataset, we use the same trained model with T = 1000 and the objective being Lγ from Eq. (5) with γ = 1; as we argued in Section 3, no changes are needed with regards to the training procedure. The only changes that we make is how we produce samples from the model; we achieve this by controlling τ (which controls how fast the samples are obtained) and σ (which interpolates between the deterministic DDIM and the stochastic DDPM).

We consider different sub-sequences τ of [1, . . . , T] and different variance hyperparameters σ indexed by elements of τ. To simplify comparisons, we consider σ with the form:

στi(η) = η q

(1 ατi 1)/(1 ατi) q

1 ατi/ατi 1, (15)

where η R 0 is a hyperparameter that we can directly control. This includes an original DDPM generative process when η = 1 and DDIM when η = 0. We also consider DDPM where the random noise has a larger standard deviation than σ(1), which we denote as ˆσ: ˆστi = p

1 ατi/ατi 1 . This is used by the implementation in Ho et al. (2020) only to obtain the CIFAR10 samples, but not samples of the other datasets. We include more details in Appendix D.

5.1 SAMPLE QUALITY AND EFFICIENCY

In Table 1, we report the quality of the generated samples with models trained on CIFAR10 and Celeb A, as measured by Frechet Inception Distance (FID (Heusel et al., 2017)), where we vary the number of timesteps used to generate a sample (dim(τ)) and the stochasticity of the process (η). As expected, the sample quality becomes higher as we increase dim(τ), presenting a tradeoff between sample quality and computational costs. We observe that DDIM (η = 0) achieves the best sample quality when dim(τ) is small, and DDPM (η = 1 and ˆσ) typically has worse sample quality compared to its less stochastic counterparts with the same dim(τ), except for the case for dim(τ) = 1000 and ˆσ reported by Ho et al. (2020) where DDIM is marginally worse. However, the sample quality of ˆσ becomes much worse for smaller dim(τ), which suggests that it is ill-suited for shorter trajectories. DDIM, on the other hand, achieves high sample quality much more consistently.

In Figure 3, we show CIFAR10 and Celeb A samples with the same number of sampling steps and varying σ. For the DDPM, the sample quality deteriorates rapidly when the sampling trajectory has

Published as a conference paper at ICLR 2021

Table 1: CIFAR10 and Celeb A image generation measured in FID. η = 1.0 and ˆσ are cases of DDPM (although Ho et al. (2020) only considered T = 1000 steps, and S < T can be seen as simulating DDPMs trained with S steps), and η = 0.0 indicates DDIM.

CIFAR10 (32 32) Celeb A (64 64) S 10 20 50 100 1000 10 20 50 100 1000

0.0 13.36 6.84 4.67 4.16 4.04 17.33 13.73 9.17 6.53 3.51 0.2 14.04 7.11 4.77 4.25 4.09 17.66 14.11 9.51 6.79 3.64 0.5 16.66 8.35 5.25 4.46 4.29 19.86 16.06 11.01 8.09 4.28 1.0 41.07 18.36 8.01 5.78 4.73 33.12 26.03 18.48 13.93 5.98

ˆσ 367.43 133.37 32.72 9.99 3.17 299.71 183.83 71.71 45.20 3.26

dim( ) = 10

dim( ) = 100

dim( ) = 10

dim( ) = 100

Figure 3: CIFAR10 and Celeb A samples with dim(τ) = 10 and dim(τ) = 100.

10 steps. For the case of ˆσ, the generated images seem to have more noisy perturbations under short trajectories; this explains why the FID scores are much worse than other methods, as FID is very sensitive to such perturbations (as discussed in Jolicoeur-Martineau et al. (2020)).

In Figure 4, we show that the amount of time needed to produce a sample scales linearly with the length of the sample trajectory. This suggests that DDIM is useful for producing samples more efﬁciently, as samples can be generated in much fewer steps. Notably, DDIM is able to produce samples with quality comparable to 1000 step models within 20 to 100 steps, which is a 10 to 50 speed up compared to the original DDPM. Even though DDPM could also achieve reasonable sample quality with 100 steps, DDIM requires much fewer steps to achieve this; on Celeb A, the FID score of the 100 step DDPM is similar to that of the 20 step DDIM.

5.2 SAMPLE CONSISTENCY IN DDIMS

For DDIM, the generative process is deterministic, and x0 would depend only on the initial state x T . In Figure 5, we observe the generated images under different generative trajectories (i.e. different τ) while starting with the same initial x T . Interestingly, for the generated images with the same initial x T , most high-level features are similar, regardless of the generative trajectory. In many cases, samples generated with only 20 steps are already very similar to ones generated with 1000 steps in terms of high-level features, with only minor differences in details. Therefore, it would appear that x T alone would be an informative latent encoding of the image; and minor details that affects sample quality are encoded in the parameters, as longer sample trajectories gives better quality samples but do not signiﬁcantly affect the high-level features. We show more samples in Appendix D.4.

10 30 100 300 1000 # steps

10 30 100 300 1000 # steps

Figure 4: Hours to sample 50k images with one Nvidia 2080 Ti GPU and samples at different steps.

Published as a conference paper at ICLR 2021

sample timesteps

sample timesteps

sample timesteps

Figure 5: Samples from DDIM with the same random x T and different number of steps.

5.3 INTERPOLATION IN DETERMINISTIC GENERATIVE PROCESSES

Figure 6: Interpolation of samples from DDIM with dim(τ) = 50.

Since the high level features of the DDIM sample is encoded by x T , we are interested to see whether it would exhibit the semantic interpolation effect similar to that observed in other implicit probabilistic models, such as GANs (Goodfellow et al., 2014). This is different from the interpolation procedure in Ho et al. (2020), since in DDPM the same x T would lead to highly diverse x0 due to the stochastic generative process6. In Figure 6, we show that simple interpolations in x T can lead to semantically meaningful interpolations between two samples. We include more details and samples in Appendix D.5. This allows DDIM to control the generated images on a high level directly through the latent variables, which DDPMs cannot.

5.4 RECONSTRUCTION FROM LATENT SPACE

As DDIM is the Euler integration for a particular ODE, it would be interesting to see whether it can encode from x0 to x T (reverse of Eq. (14)) and reconstruct x0 from the resulting x T (forward

6Although it might be possible if one interpolates all T noises, like what is done in Song & Ermon (2020).

Published as a conference paper at ICLR 2021

Table 2: Reconstruction error with DDIM on CIFAR-10 test set, rounded to 10 4.

S 10 20 50 100 200 500 1000

Error 0.014 0.0065 0.0023 0.0009 0.0004 0.0001 0.0001

of Eq. (14))7. We consider encoding and decoding on the CIFAR-10 test set with the CIFAR-10 model with S steps for both encoding and decoding; we report the per-dimension mean squared error (scaled to [0, 1]) in Table 2. Our results show that DDIMs have lower reconstruction error for larger S values and have properties similar to Neural ODEs and normalizing ﬂows. The same cannot be said for DDPMs due to their stochastic nature.

6 RELATED WORK

Our work is based on a large family of existing methods on learning generative models as transition operators of Markov chains (Sohl-Dickstein et al., 2015; Bengio et al., 2014; Salimans et al., 2014; Song et al., 2017; Goyal et al., 2017; Levy et al., 2017). Among them, denoising diffusion probabilistic models (DDPMs, Ho et al. (2020)) and noise conditional score networks (NCSN, Song & Ermon (2019; 2020)) have recently achieved high sample quality comparable to GANs (Brock et al., 2018; Karras et al., 2018). DDPMs optimize a variational lower bound to the log-likelihood, whereas NCSNs optimize the score matching objective (Hyv arinen, 2005) over a nonparametric Parzen density estimator of the data (Vincent, 2011; Raphan & Simoncelli, 2011).

Despite their different motivations, DDPMs and NCSNs are closely related. Both use a denoising autoencoder objective for many noise levels, and both use a procedure similar to Langevin dynamics to produce samples (Neal et al., 2011). Since Langevin dynamics is a discretization of a gradient ﬂow (Jordan et al., 1998), both DDPM and NCSN require many steps to achieve good sample quality. This aligns with the observation that DDPM and existing NCSN methods have trouble generating high-quality samples in a few iterations.

DDIM, on the other hand, is an implicit generative model (Mohamed & Lakshminarayanan, 2016) where samples are uniquely determined from the latent variables. Hence, DDIM has certain properties that resemble GANs (Goodfellow et al., 2014) and invertible ﬂows (Dinh et al., 2016), such as the ability to produce semantically meaningful interpolations. We derive DDIM from a purely variational perspective, where the restrictions of Langevin dynamics are not relevant; this could partially explain why we are able to observe superior sample quality compared to DDPM under fewer iterations. The sampling procedure of DDIM is also reminiscent of neural networks with continuous depth (Chen et al., 2018; Grathwohl et al., 2018), since the samples it produces from the same latent variable have similar high-level visual features, regardless of the speciﬁc sample trajectory.

7 DISCUSSION

We have presented DDIMs an implicit generative model trained with denoising auto-encoding / score matching objectives from a purely variational perspective. DDIM is able to generate highquality samples much more efﬁciently than existing DDPMs and NCSNs, with the ability to perform meaningful interpolations from the latent space. The non-Markovian forward process presented here seems to suggest continuous forward processes other than Gaussian (which cannot be done in the original diffusion framework, since Gaussian is the only stable distribution with ﬁnite variance). We also demonstrated a discrete case with a multinomial forward process in Appendix A, and it would be interesting to investigate similar alternatives for other combinatorial structures.

Moreover, since the sampling procedure of DDIMs is similar to that of an neural ODE, it would be interesting to see if methods that decrease the discretization error in ODEs, including multistep methods such as Adams-Bashforth (Butcher & Goodwin, 2008), could be helpful for further improving sample quality in fewer steps (Queiruga et al., 2020). It is also relevant to investigate whether DDIMs exhibit other properties of existing implicit models (Bau et al., 2019).

7Since x T and x0 have the same dimensions, their compression qualities are not our immediate concern.

Published as a conference paper at ICLR 2021

Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein GAN. ar Xiv preprint ar Xiv:1701.07875, January 2017.

David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. Seeing what a gan cannot generate. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4502 4511, 2019.

Yoshua Bengio, Eric Laufer, Guillaume Alain, and Jason Yosinski. Deep generative stochastic networks trainable by backprop. In International Conference on Machine Learning, pp. 226 234, January 2014.

Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high ﬁdelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096, September 2018.

John Charles Butcher and Nicolette Goodwin. Numerical methods for ordinary differential equations, volume 2. Wiley Online Library, 2008.

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wave Grad: Estimating gradients for waveform generation. ar Xiv preprint ar Xiv:2009.00713, September 2020.

Ricky T Q Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. ar Xiv preprint ar Xiv:1806.07366, June 2018.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. ar Xiv preprint ar Xiv:1605.08803, May 2016.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Anirudh Goyal, Nan Rosemary Ke, Surya Ganguli, and Yoshua Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. In Advances in Neural Information Processing Systems, pp. 4392 4402, 2017.

Will Grathwohl, Ricky T Q Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367, October 2018.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5769 5779, 2017.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two Time-Scale update rule converge to a local nash equilibrium. ar Xiv preprint ar Xiv:1706.08500, June 2017.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. ar Xiv preprint ar Xiv:2006.11239, June 2020.

Aapo Hyv arinen. Estimation of Non-Normalized statistical models by score matching. Journal of Machine Learning Researc h, 6:695 709, 2005.

Alexia Jolicoeur-Martineau, R emi Pich e-Taillefer, R emi Tachet des Combes, and Ioannis Mitliagkas. Adversarial score matching and improved sampling for image generation. September 2020.

Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker planck equation. SIAM journal on mathematical analysis, 29(1):1 17, 1998.

Published as a conference paper at ICLR 2021

Tero Karras, Samuli Laine, and Timo Aila. A Style-Based generator architecture for generative adversarial networks. ar Xiv preprint ar Xiv:1812.04948, December 2018.

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110 8119, 2020.

Diederik P Kingma and Max Welling. Auto-Encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114v10, December 2013.

Daniel Levy, Matthew D Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte carlo with neural networks. ar Xiv preprint ar Xiv:1711.09268, 2017.

Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. ar Xiv preprint ar Xiv:1610.03483, October 2016.

Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.

Alejandro F Queiruga, N Benjamin Erichson, Dane Taylor, and Michael W Mahoney. Continuousin-depth neural networks. ar Xiv preprint ar Xiv:2008.02389, 2020.

Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural computation, 23(2):374 420, February 2011. ISSN 0899-7667, 1530-888X.

Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. ar Xiv preprint ar Xiv:1505.05770, May 2015.

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082, 2014.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234 241. Springer, 2015.

Tim Salimans, Diederik P Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. ar Xiv preprint ar Xiv:1410.6460, October 2014.

Ken Shoemake. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pp. 245 254, 1985.

Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. ar Xiv preprint ar Xiv:1503.03585, March 2015.

Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-mc: Adversarial training for mcmc. ar Xiv preprint ar Xiv:1706.07561, June 2017.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. ar Xiv preprint ar Xiv:1907.05600, July 2019.

Yang Song and Stefano Ermon. Improved techniques for training Score-Based generative models. ar Xiv preprint ar Xiv:2006.09011, June 2020.

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wave Net: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, September 2016a.

Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. ar Xiv preprint ar Xiv:1601.06759, January 2016b.

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011.

Published as a conference paper at ICLR 2021

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, May 2016.

Shengjia Zhao, Hongyu Ren, Arianna Yuan, Jiaming Song, Noah Goodman, and Stefano Ermon. Bias and generalization in deep generative models: An empirical study. In Advances in Neural Information Processing Systems, pp. 10792 10801, 2018.

Published as a conference paper at ICLR 2021

A NON-MARKOVIAN FORWARD PROCESSES FOR A DISCRETE CASE

In this section, we describe a non-Markovian forward processes for discrete data and corresponding variational objectives. Since the focus of this paper is to accelerate reverse models corresponding to the Gaussian diffusion, we leave empirical evaluations as future work.

For a categorical observation x0 that is a one-hot vector with K possible values, we deﬁne the forward process as follows. First, we have q(xt|x0) as the following categorical distribution:

q(xt|x0) = Cat(αtx0 + (1 αt)1K) (16)

where 1K RK is a vector with all entries being 1/K, and αt decreasing from α0 = 1 for t = 0 to αT = 0 for t = T. Then we deﬁne q(xt 1|xt, x0) as the following mixture distribution:

q(xt 1|xt, x0) =

Cat(xt) with probability σt Cat(x0) with probability (αt 1 σtαt) Cat(1K) with probability (1 αt 1) (1 αt)σt , (17)

or equivalently:

q(xt 1|xt, x0) = Cat (σtxt + (αt 1 σtαt)x0 + ((1 αt 1) (1 αt)σt)1K) , (18)

which is consistent with how we have deﬁned q(xt|x0).

Similarly, we can deﬁne our reverse process pθ(xt 1|xt) as:

pθ(xt 1|xt) = Cat σtxt + (αt 1 σtαt)f (t) θ (xt) + ((1 αt 1) (1 αt)σt)1K , (19)

where f (t) θ (xt) maps xt to a K-dimensional vector. As (1 αt 1) (1 αt)σt 0, the sampling process will become less stochastic, in the sense that it will either choose xt or the predicted x0 with high probability. The KL divergence

DKL(q(xt 1|xt, x0) pθ(xt 1|xt)) (20)

is well-deﬁned, and is simply the KL divergence between two categoricals. Therefore, the resulting variational objective function should be easy to optimize as well. Moreover, as KL divergence is convex, we have this upper bound (which is tight when the right hand side goes to zero):

DKL(q(xt 1|xt, x0) pθ(xt 1|xt)) (αt 1 σtαt)DKL(Cat(x0) Cat(f (t) θ (xt))).

The right hand side is simply a multi-class classiﬁcation loss (up to constants), so we can arrive at similar arguments regarding how changes in σt do not affect the objective (up to re-weighting).

Lemma 1. For qσ(x1:T |x0) deﬁned in Eq. (6) and qσ(xt 1|xt, x0) deﬁned in Eq. (7), we have:

qσ(xt|x0) = N( αtx0,

1 αt I) (21)

Proof. Assume for any t T, qσ(xt|x0) = N( αtx0, 1 αt I) holds, if:

qσ(xt 1|x0) = N( αt 1x0, p

1 αt 1I) (22)

then we can prove the statement with an induction argument for t from T to 1, since the base case (t = T) already holds.

First, we have that

qσ(xt 1|x0) := Z

xt qσ(xt|x0)qσ(xt 1|xt, x0)dxt

qσ(xt|x0) = N( αtx0, (1 αt)I) (23)

qσ(xt 1|xt, x0) = N αt 1x0 + q

1 αt 1 σ2 t xt αtx0 1 αt , σ2 t I . (24)

Published as a conference paper at ICLR 2021

From Bishop (2006) (2.115), we have that ˆq(xt 1|x0) is Gaussian and

E[qσ(xt 1|x0)] = αt 1x0 + q

1 αt 1 σ2 t αtx0 αtx0 1 αt (25)

= αt 1x0 (26)

Cov[qσ(xt 1|x0)] = σ2 t I + 1 αt 1 σ2 t 1 αt (1 αt)I = (1 αt 1)I (27)

Therefore, qσ(xt 1|x0) = N( αt 1x0, 1 αt 1I), which allows us to apply the induction argument.

Theorem 1. For all σ > 0, there exists γ RT >0 and C R, such that Jσ = Lγ + C.

Proof. From the deﬁnition of Jσ:

Jσ(ϵθ) := Ex0:T q(x0:T )

qσ(x T |x0) +

t=2 log qσ(xt 1|xt, x0)

t=1 log p(t) θ (xt 1|xt)

Ex0:T q(x0:T )

t=2 DKL(qθ(xt 1|xt, x0)) p(t) θ (xt 1|xt)) log p(1) θ (x0|x1)

where we use to denote equal up to a value that does not depend on ϵθ (but may depend on qσ) . For t > 1:

Ex0,xt q(x0,xt)[DKL(qσ(xt 1|xt, x0)) p(t) θ (xt 1|xt))]

= Ex0,xt q(x0,xt)[DKL(qσ(xt 1|xt, x0)) qσ(xt 1|xt, f (t) θ (xt)))]

= Ex0,xt q(x0,xt)

x0 f (t) θ (xt) 2

= Ex0 q(x0),ϵ N(0,I),xt= αtx0+ 1 αtϵ

(xt ϵ)/ αt (xt ϵ(t) θ (xt))/ αt 2

= Ex0 q(x0),ϵ N(0,I),xt= αtx0+ 1 αtϵ

ϵ ϵ(t) θ (xt) 2

2 2dσ2 t αt

where d is the dimension of x0. For t = 0:

Ex0,x1 q(x0,x1) h log p(1) θ (x0|x1) i Ex0,x1 q(x0,x1)

x0 f (t) θ (x1) 2

= Ex0 q(x0),ϵ N(0,I),x1= α1x0+ 1 αtϵ

ϵ ϵ(1) θ (x1) 2

Therefore, when γt = 1/(2dσ2 t αt) for all t {1, . . . , T}, we have

1 2dσ2 t αt E h ϵ(t) θ (xt) ϵt 2

2 i = Lγ(ϵθ) (34)

for all ϵθ. From the deﬁnition of , we have that Jσ = Lγ + C.

Published as a conference paper at ICLR 2021

C ADDITIONAL DERIVATIONS

C.1 ACCELERATED SAMPLING PROCESSES

In the accelerated case, we can consider the inference process to be factored as:

qσ,τ(x1:T |x0) = qσ,τ(xτS|x0)

i=1 qσ,τ(xτi 1|xτi, x0) Y

t τ qσ,τ(xt|x0) (35)

where τ is a sub-sequence of [1, . . . , T] of length S with τS = T, and let τ := {1, . . . , T} \ τ be its complement. Intuitively, the graphical model of {xτi}S i=1 and x0 form a chain, whereas the graphical model of {xt}t τ and x0 forms a star graph. We deﬁne:

qσ,τ(xt|x0) = N( αtx0, (1 αt)I) t τ {T} (36)

qσ,τ(xτi 1|xτi, x0) = N ατi 1x0 + q

1 ατi 1 σ2τi xτi ατix0 1 ατi , σ2 τi I i [S]

where the coefﬁcients are chosen such that:

qσ,τ(xτi|x0) = N( ατix0, (1 ατi)I) i [S] (37)

i.e., the marginals match.

The corresponding generative process is deﬁned as:

pθ(x0:T ) := pθ(x T )

i=1 p(τi) θ (xτi 1|xτi)

| {z } use to produce samples

t τ p(t) θ (x0|xt)

| {z } in variational objective

where only part of the models are actually being used to produce samples. The conditionals are:

p(τi) θ (xτi 1|xτi) = qσ,τ(xτi|xτi, f (τi) θ (xτi 1)) if i [S], i > 1 (39)

p(t) θ (x0|xt) = N(f (t) θ (xt), σ2 t I) otherwise, (40)

where we leverage qσ,τ(xτi 1|xτi, x0) as part of the inference process (similar to what we have done in Section 3). The resulting variational objective becomes (deﬁne xτL+1 = for conciseness):

J(ϵθ) = Ex0:T qσ,τ (x0:T )[log qσ,τ(x1:T |x0) log pθ(x0:T )] (41)

= Ex0:T qσ,τ (x0:T )

t τ DKL(qσ,τ(xt|x0) p(t) θ (x0|xt) (42)

i=1 DKL(qσ,τ(xτi 1|xτi, x0) p(τi) θ (xτi 1|xτi)))

where each KL divergence is between two Gaussians with variance independent of θ. A similar argument to the proof used in Theorem 1 can show that the variational objective J can also be converted to an objective of the form Lγ.

C.2 DERIVATION OF DENOISING OBJECTIVES FOR DDPMS

We note that in Ho et al. (2020), a diffusion hyperparameter βt8 is ﬁrst introduced, and then relevant variables αt := 1 βt and αt = QT t=1 αt are deﬁned. In this paper, we have used the notation αt to represent the variable αt in Ho et al. (2020) for three reasons. First, it makes it more clear that we only need to choose one set of hyperparameters, reducing possible cross-references of the derived variables. Second, it allows us to introduce the generalization as well as the acceleration case easier, because the inference process is no longer motivated by a diffusion. Third, there exists an isomorphism between α1:T and 1, . . . , T, which is not the case for βt.

8In this section we use teal to color notations used in Ho et al. (2020).

Published as a conference paper at ICLR 2021

In this section, we use βt and αt to be more consistent with the derivation in Ho et al. (2020), where

αt = αt αt 1 (43)

βt = 1 αt αt 1 (44)

can be uniquely determined from αt (i.e. αt).

First, from the diffusion forward process:

q(xt 1|xt, x0) = N

1 αt x0 + αt(1 αt 1)

1 αt xt | {z } µ(xt,x0)

Ho et al. (2020) considered a speciﬁc type of p(t) θ (xt 1|xt):

p(t) θ (xt 1|xt) = N (µθ(xt, t), σt I) (45)

which leads to the following variational objective:

L := Ex0:T q(x0:T )

q(x T |x0) +

t=2 log q(xt 1|xt, x0)

t=1 log p(t) θ (xt 1|xt)

Ex0:T q(x0:T )

t=2 DKL(q(xt 1|xt, x0)) p(t) θ (xt 1|xt)) | {z } Lt 1

log p(1) θ (x0|x1)

One can write:

2σ2 t µθ(xt, t) µ(xt, x0) 2 2

Ho et al. (2020) chose the parametrization

µθ(xt, t) = 1 αt

which can be simpliﬁed to:

Lt 1 = Ex0,ϵ

2σ2 t (1 αt)αt

ϵ ϵθ( αtx0 +

1 αtϵ, t) 2 2

D EXPERIMENTAL DETAILS

D.1 DATASETS AND ARCHITECTURES

We consider 4 image datasets with various resolutions: CIFAR10 (32 32, unconditional), Celeb A (64 64), LSUN Bedroom (256 256) and LSUN Church (256 256). For all datasets, we set the hyperparameters α according to the heuristic in (Ho et al., 2020) to make the results directly comparable. We use the same model for each dataset, and only compare the performance of different generative processes. For CIFAR10, Bedroom and Church, we obtain the pretrained checkpoints from the original DDPM implementation; for Celeb A, we trained our own model using the denoising objective L1.

Our architecture for ϵ(t) θ (xt) follows that in Ho et al. (2020), which is a U-Net (Ronneberger et al., 2015) based on a Wide Res Net (Zagoruyko & Komodakis, 2016). We use the pretrained models from Ho et al. (2020) for CIFAR10, Bedroom and Church, and train our own model for the Celeb A 64 64 model (since a pretrained model is not provided). Our Celeb A model has ﬁve feature map resolutions from 64 64 to 4 4, and we use the original Celeb A dataset (not Celeb A-HQ) using the pre-processing technique from the Style GAN (Karras et al., 2018) repository.

Published as a conference paper at ICLR 2021

Table 3: LSUN Bedroom and Church image generation results, measured in FID. For 1000 steps DDPM, the FIDs are 6.36 for Bedroom and 7.89 for Church.

Bedroom (256 256) Church (256 256) dim(τ) 10 20 50 100 10 20 50 100

DDIM (η = 0.0) 16.95 8.89 6.75 6.62 19.45 12.47 10.84 10.58 DDPM (η = 1.0) 42.78 22.77 10.81 6.81 51.56 23.37 11.16 8.27

D.2 REVERSE PROCESS SUB-SEQUENCE SELECTION

We consider two types of selection procedure for τ given the desired dim(τ) < T:

Linear: we select the timesteps such that τi = ci for some c; Quadratic: we select the timesteps such that τi = ci2 for some c.

The constant value c is selected such that τ 1 is close to T. We used quadratic for CIFAR10 and linear for the remaining datasets. These choices achieve slightly better FID than their alternatives in the respective datasets.

D.3 CLOSED FORM EQUATIONS FOR EACH SAMPLING STEP

From the general sampling equation in Eq. (12), we have the following update equation:

xτi 1(η) = ατi 1

xτi 1 ατiϵ(τi) θ (xτi) ατi

1 ατi 1 στi(η)2 ϵ(τi) θ (τi) + στi(η)ϵ

ατi 1 For the case of ˆσ (DDPM with a larger variance), the update equation becomes:

xτi 1 = ατi 1

xτi 1 ατiϵ(τi) θ (xτi) ατi

1 ατi 1 στi(1)2 ϵ(τi) θ (τi) + ˆστiϵ

which uses a different coefﬁcient for ϵ compared with the update for η = 1, but uses the same coefﬁcient for the non-stochastic parts. This update is more stochastic than the update for η = 1, which explains why it achieves worse performance when dim(τ) is small.

D.4 SAMPLES AND CONSISTENCY

We show more samples in Figure 7 (CIFAR10), Figure 8 (Celeb A), Figure 10 (Church) and consistency results of DDIM in Figure 9 (Celeb A).

D.5 INTERPOLATION

To generate interpolations on a line, we randomly sample two initial x T values from the standard Gaussian, interpolate them with spherical linear interpolation (Shoemake, 1985), and then use the DDIM to obtain x0 samples.

x(α) T = sin((1 α)θ)

sin(θ) x(0) T + sin(αθ)

sin(θ) x(1) T (50)

where θ = arccos (x(0) T ) x(1) T x(0) T x(1) T

. These values are used to produce DDIM samples.

To generate interpolations on a grid, we sample four latent variables and separate them in to two pairs; then we use slerp with the pairs under the same α, and use slerp over the interpolated samples across the pairs (under an independently chosen interpolation coefﬁcient). We show more grid interpolation results in Figure 11 (Celeb A), Figure 12 (Bedroom), and Figure 13 (Church).

Published as a conference paper at ICLR 2021

Figure 7: CIFAR10 samples from 1000 step DDPM, 1000 step DDIM and 100 step DDIM.

Figure 8: Celeb A samples from 1000 step DDPM, 1000 step DDIM and 100 step DDIM.

sample timesteps

Figure 9: Celeb A samples from DDIM with the same random x T and different number of steps.

Published as a conference paper at ICLR 2021

Figure 10: Church samples from 100 step DDPM and 100 step DDIM.

Figure 11: More interpolations from the Celeb A DDIM with dim(τ) = 50.

Published as a conference paper at ICLR 2021

Figure 12: More interpolations from the Bedroom DDIM with dim(τ) = 50.

Figure 13: More interpolations from the Church DDIM with dim(τ) = 50.