# diffenc_variational_diffusion_with_a_learned_encoder__ba1ceb56.pdf

Published as a conference paper at ICLR 2024

DIFFENC: VARIATIONAL DIFFUSION WITH A LEARNED ENCODER

Beatrix M. G. Nielsen, 1 Anders Christensen,1,2 Andrea Dittadi, 2,4 Ole Winther 1,3,5

1Technical University of Denmark, 2Helmholtz AI, Munich, 3University of Copenhagen, 4Max Planck Institute for Intelligent Systems, 5Copenhagen University Hospital

Diffusion models may be viewed as hierarchical variational autoencoders (VAEs) with two improvements: parameter sharing for the conditional distributions in the generative process and efficient computation of the loss as independent terms over the hierarchy. We consider two changes to the diffusion model that retain these advantages while adding flexibility to the model. Firstly, we introduce a dataand depth-dependent mean function in the diffusion process, which leads to a modified diffusion loss. Our proposed framework, Diff Enc, achieves a statistically significant improvement in likelihood on CIFAR-10. Secondly, we let the ratio of the noise variance of the reverse encoder process and the generative process be a free weight parameter rather than being fixed to 1. This leads to theoretical insights: For a finite depth hierarchy, the evidence lower bound (ELBO) can be used as an objective for a weighted diffusion loss approach and for optimizing the noise schedule specifically for inference. For the infinite-depth hierarchy, on the other hand, the weight parameter has to be 1 to have a well-defined ELBO.

1 INTRODUCTION

Standard, pre-deﬁned

Diff Enc, learned

Diffusion Process

Encoder: discarded after training

𝜶sᐧ 𝜶tᐧ 𝜶1ᐧ

, s , t , 1

Encoder Encoder

Figure 1: Overview of Diff Enc compared to standard diffusion models. The effect of the encoding has been amplified 5x for the sake of illustration.

Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) are versatile generative models that have risen to prominence in recent years thanks to their state-ofthe-art performance in the generation of images (Dhariwal & Nichol, 2021; Karras et al., 2022), video (Ho et al., 2022; H oppe et al., 2022; Harvey et al., 2022), speech (Kong et al., 2020; Jeong et al., 2021; Chen et al., 2020), and music (Huang et al., 2023; Schneider et al., 2023). In particular, in image generation, diffusion models are state of the art both in terms of visual quality (Karras et al., 2022; Kim et al., 2022a; Zheng et al., 2022; Hoogeboom et al., 2023; Kingma & Gua, 2023; Lou & Ermon, 2023) and density estimation (Kingma et al., 2021; Nichol & Dhariwal, 2021; Song et al., 2021).

Diffusion models can be seen as a time-indexed hierarchy over latent variables generated sequentially, conditioning only on the latent vector from the previous step. As such, diffusion models can be understood as hierarchical variational autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014; Sønderby et al., 2016) with three restrictions: (1) the forward diffusion process the inference model in variational inference is fixed and remarkably simple; (2) the generative model is Markovian each (time-indexed) layer of latent variables is generated conditioning only on the previous layer; (3) parameter sharing all steps of the generative model share the same parameters.

Correspondence to: <bmgi@dtu.dk>. Equal advising.

Published as a conference paper at ICLR 2024

Figure 2: Changes induced by the encoder on the encoded image at different timesteps: (xt xs)/(t s) for t = 0.4, 0.6, 0.8, 1.0 and s = t 0.1. Changes have been summed over the channels with red and blue denoting positive and negative changes, respectively. For t 1, global properties such as approximate position of objects are encoded, where for smaller t changes are more fine-grained and tend to enhance high-contrast within objects and/or between object and background.

The simplicity of the forward process (1) and the Markov property of the generative model (2) allow the evidence lower bound (ELBO) to be expressed as an expectation over the layers of random variables, i.e., an expectation over time from the stochastic process perspective. Thanks to the heavy parameter sharing in the generative model (3), this expectation can be estimated effectively with a single Monte Carlo sample. These properties make diffusion models highly scalable and flexible, despite the constraints discussed above.

In this work, we relax assumption (1) to improve the flexibility of diffusion models while retaining their scalability. Specifically, we shift away from assuming a constant diffusion process, while still maintaining sufficient simplicity to express the ELBO as an expectation over time. We introduce a time-dependent encoder that parameterizes the mean of the diffusion process: instead of the original image x, the learned denoising model is tasked with predicting xt, which is the encoded image at time t. Crucially, this encoder is exclusively employed during the training phase and not utilized during the sampling process. As a result, the proposed class of diffusion models, Diff Enc, is more flexible than standard diffusion models without affecting sampling time. To arrive at the negative log likelihood loss for Diff Enc, Eq. (18), we will first show how we can introduce a timedependent encoder to the diffusion process and how this introduces an extra term in the loss if we use the usual expression for the mean in the generative model, Section 3. We then show how we can counter this extra term, using a certain parametrization of the encoder, Section 4.

We conduct experiments on MNIST, CIFAR-10 and Image Net32 with two different parameterizations of the encoder and find that, with a trainable encoder, Diff Enc improves total likelihood on CIFAR-10 and improves the latent loss on all datasets without damaging the diffusion loss. We observe that the changes to xt are significantly different for early and late timesteps, demonstrating the non-trivial, time-dependent behavior of the encoder (see Fig. 2).

In addition, we investigate the relaxation of a common assumption in diffusion models: That the variance of the generative process, σ2 P , is equal to the variance of the reverse formulation of the forward diffusion process, σ2 Q. This introduces an additional term in the diffusion loss, which can be interpreted as a weighted loss (with time-dependent weights wt). We then analytically derive the optimal σ2 P . While this is relevant when training in discrete time (i.e., with a finite number of layers) or when sampling, we prove that the ELBO is maximized in the continuous-time limit when the variances are equal (in fact, the ELBO diverges if the variances are not equal).

Our main contributions can be summarized as follows:

We define a new, more powerful class of diffusion models named Diff Enc by introducing a time-dependent encoder in the diffusion process. This encoder improves the flexibility of diffusion models but does not affect sampling time, as it is only needed during training.

We analyse the assumption of forward and backward variances being equal, and prove that (1) by relaxing this assumption, the diffusion loss can be interpreted as a weighted loss, and (2) in continuous time, the optimal ELBO is achieved when the variances are equal in fact, if the variances are not equal in continuous time, the ELBO is not well-defined.

We perform extensive density estimation experiments and show that Diff Enc achieves a statistically significant improvement in likelihood on CIFAR-10.

The paper is organized as follows: In Section 2 we introduce the notation and framework from Variational Diffusion Models (VDM; Kingma et al., 2021); in Section 3 we derive the general formulation

Published as a conference paper at ICLR 2024

of Diff Enc by introducing a depth-dependent encoder; in Section 4 we introduce the encoder parameterizations used in our experiments and modify the generative model to account for the change in the diffusion loss due to the encoder; in Section 5 we present our experimental results.

2 PRELIMINARIES ON VARIATIONAL DIFFUSION MODELS

We begin by introducing the VDM formulation (Kingma et al., 2021) of diffusion models. We define a hierarchical generative model with T + 1 layers of latent variables:

pθ(x, z) = p(x|z0)p(z1)

i=1 pθ(zs(i)|zt(i)) (1)

with x X a data point, θ the model parameters, s(i) = i 1

T , t(i) = i T , and p(z1) = N(0, I). In the following, we will drop the index i and assume 0 s < t 1. We define a diffusion process q with marginal distribution:

q(zt|x) = N(αtx, σ2 t I) (2)

where t [0, 1] is the time index and αt, σt are positive scalar functions of t. Requiring Eq. (2) to hold for any s and t, the conditionals turns out to be:

q(zt|zs) = N(αt|szs, σ2 t|s I) ,

αs , σ2 t|s = σ2 t α2 t|sσ2 s .

Using Bayes rule, we can reverse the direction of the diffusion process:

q(zs|zt, x) = N(µQ, σ2 QI) (3)

σ2 Q = σ2 t|sσ2 s σ2 t , µQ = αt|sσ2 s σ2 t zt + αsσ2 t|s σ2 t x . (4)

We can now express the diffusion process in a way that mirrors the generative model in Eq. (1):

q(z|x) = q(z1|x)

i=1 q(zs(i)|zt(i), x) (5)

and we can define one step of the generative process in the same functional form as Eq. (3):

pθ(zs|zt) = N(µP , σ2 P I)

µP = αt|sσ2 s σ2 t zt + αsσ2 t|s σ2 t ˆxθ(zt, t) , (6)

where ˆxθ is a learned model with parameters θ. In a diffusion model, the denoising variance σ2 P is usually chosen to be equal to the reverse diffusion process variance: σ2 P = σ2 Q. While initially we do not make this assumption, we will prove this to be optimal in the continuous-time limit. Following VDM, we parameterize the noise schedule through the signal-to-noise ratio (SNR):

SNR(t) α2 t σ2 t and its logarithm: λt log SNR(t). We will use the variance-preserving formulation in all our experiments: α2 t = 1 σ2 t = sigmoid(λt).

The evidence lower bound (ELBO) of the model defined above is:

log pθ(x) Eq(z|x)

pθ(x|z)pθ(z)

Published as a conference paper at ICLR 2024

The loss L ELBO is the sum of a reconstruction (L0), diffusion (LT ), and latent (L1) loss:

L = L0 + LT + L1 L0 = Eq(z0|x) [log p(x|z0)]

L1 = DKL(q(z1|x)||p(z1)) ,

where the expressions for L0 and L1 are derived in Appendix D. Thanks to the matching factorization of the generative and reverse noise processes see Eqs. (1) and (5) and the availability of q(zt|x) in closed form because q is Markov and Gaussian, the diffusion loss LT can be written as a sum or as an expectation over the layers of random variables:

i=1 Eq(zt(i)|x) DKL(q(zs(i)|zt(i), x) pθ(zs(i)|zt(i))) (7)

= T Ei U{1,T },q(zt(i)|x) DKL(q(zs(i)|zt(i), x) pθ(zs(i)|zt(i))) , (8)

where U{1, T} is the uniform distribution over the indices 1 through T. Since all distributions are Gaussian, the KL divergence has a closed-form expression (see Appendix E):

DKL(q(zs|zt, x) pθ(zs|zt)) = d

2 (wt 1 log wt) + wt

2σ2 Q µP µQ 2 2 , (9)

where the green part is the difference from using σ2 P = σ2 Q instead of σ2 P = σ2 Q, and we have defined the weighting function

wt = σ2 Q,t σ2 P,t

and the dependency of σ2 Q,t and σ2 P,t on s is left implicit, since the step size t s = 1

T is fixed. The optimal generative variance can be computed in closed-form (see Appendix F):

σ2 P = σ2 Q + 1

d Eq(x,zt) µP µQ 2 2 .

The main component of Diff Enc is the time-dependent encoder, which we will define as xt xϕ(λt), where xϕ(λt) is some function with parameters ϕ dependent on x and t through λt log SNR(t). The generalized version of Eq. (2) is then:

q(zt|x) = N(αtxt, σ2 t I) . (10)

Fig. 1 visualizes this change to the diffusion process, and a diagram is provided in Appendix A. Requiring that the process is consistent upon marginalization, i.e., q(zt|x) = R q(zt|zs, x)q(zs|x)dzs, leads to the following conditional distributions (see Appendix B):

q(zt|zs, x) = N(αt|szs + αt(xt xs), σ2 t|s I) , (11)

where an additional mean shift term is introduced by the depth-dependent encoder. As in Section 2, we can derive the reverse process (see Appendix C):

q(zs|zt, x) = N(µQ, σ2 QI) (12)

µQ = αt|sσ2 s σ2 t zt + αsσ2 t|s σ2 t xt + αs(xs xt) (13)

with σ2 Q given by Eq. (4). We show how we parameterize the encoder in Section 4.

Infinite-depth limit. Kingma et al. (2021) derived the continuous-time limit of the diffusion loss, that is, the loss in the limit of T . We can extend that result to our case. Using µQ from Eq. (13) and µP from Eq. (6), the KL divergence in the unweighted case, i.e., 1 2σ2 Q µP µQ 2 2, can be rewritten in the following way, as shown in Appendix G:

1 2σ2 Q µP µQ 2 2 = 1

2 SNR ˆxθ(zt, t) xϕ(λt) SNR(s) x SNR

Published as a conference paper at ICLR 2024

where x xϕ(λt) xϕ(λs) and similarly for the SNR. In Appendix G, we also show that, as T , the expression for the optimal σP tends to σQ and the additional term in the diffusion loss arising from allowing σ2 P = σ2 Q tends to 0. This result is in accordance with prior work on variational approaches to stochastic processes (Archambeau et al., 2007). We have shown that, in the continuous limit, the ELBO has to be an unweighted loss (in the sense that wt = 1). In the remainder of the paper, we will use the continuous formulation and thus set wt = 1. It is of interest to consider optimized weighted losses for a finite number of layers, however, we leave this for future research.

The infinite-depth limit of the diffusion loss, L (x) lim T LT (x), becomes (Appendix G):

2Et U(0,1)Eq(zt|x)

ˆxθ(zt, t) xϕ(λt) dxϕ(λt)

L (x) thus is very similar to the standard continuous-time diffusion loss from VDM, though with an additional gradient stemming from the mean shift term. In Section 4, we will develop a modified generative model to counter this extra term. In Appendix H, we derive the stochastic differential equation (SDE) describing the generative model of Diff Enc in the infinite-depth limit.

4 PARAMETERIZATION OF THE ENCODER AND GENERATIVE MODEL

We now turn to the parameterization of the encoder xϕ(λt). The reconstruction and latent losses impose constraints on how the encoder should behave at the two ends of the hierarchy of latent variables: The likelihood we use is constructed such that the reconstruction loss, derived in Appendix D, is minimized when xϕ(λ0) = x. Likewise, the latent loss is minimized by xϕ(λ1) = 0. In between, for 0 < t < 1, a non-trivial encoder can improve the diffusion loss.

We propose two related parameterizations of the encoder: a trainable one, which we will denote by xϕ, and a simpler, non-trainable one, xnt, where nt stands for non-trainable. Let yϕ(x, λt) be a neural network with parameters ϕ, denoted yϕ(λt) for brevity. We define the trainable encoder as

xϕ(λt) = (1 σ2 t )x + σ2 t yϕ(λt) = α2 tx + σ2 t yϕ(λt) (15)

and the non-trainable encoder as

xnt(λt) = α2 tx . (16)

More motivation for these parameterizations can be found in Appendix I. The trainable encoder xϕ is initialized with yϕ(λt) = 0, so at the start of training it acts as the non-trainable encoder xnt (but differently from the VDM, which corresponds to the identity encoder).

To better fit the infinite-depth diffusion loss in Eq. (14), we define a new mean, µP , of the generative model pθ(zs|zt) which is a modification of Eq. (6). Concretely, we would like to introduce a counterterm in µP that, when taking the continuous-limit, approximately counters dxϕ(λt)

dλt . This term should be expressed in terms of ˆxθ(λt) rather than xϕ. For the non-trainable encoder, we have

dλt = α2 tσ2 t x = σ2 t xnt(λt) .

Therefore, for the non-trainable encoder, we can use σ2 t ˆxθ(λt) as an approximation of dxnt(λt)

dλt . The trainable encoder is more complicated because it also contains the derivative of yϕ that we cannot as straightforwardly express in terms of ˆxθ. We therefore choose to approximate dxϕ(λt)

same way as dxnt(λt)

dλt . We leave it for future work to explore different strategies for approximating this gradient. Since we use the same approximation for both encoders, in the following we will write xϕ(λt) for both.

With the chosen counterterm, which in the continuous limit should approximately cancel out the effect of the mean shift term in Eq. (13), the new mean, µP , is defined as:

µP = αt|sσ2 s σ2 t zt + αsσ2 t|s σ2 t ˆxθ(λt) + αs(λs λt)σ2 t ˆxθ(λt) (17)

Published as a conference paper at ICLR 2024

Table 1: Comparison of average bits per dimension (BPD) over 3 seeds on CIFAR-10 and Image Net32 with other work. Types of models are Continuous Flow (Flow), Variational Auto Encoders (VAE), Auto Regressive models (AR) and Diffusion models (Diff). We only compare with results achieved without data augmentation. Diff Enc with a trainable encoder improves performance of the VDM on CIFAR-10. Results on Image Net marked with are on the (Van Den Oord et al., 2016) version of Image Net which is no longer officially available. Results without are on the (Chrabaszcz et al., 2017) version of Image Net, which is from the official Image Net website. Results from (Zheng et al., 2023) are without importance sampling, since importance sampling could also be added to our approach.

Model Type CIFAR-10 Image Net 32 32

Flow Matching OT (Lipman et al., 2022) Flow 2.99 3.53 Stochastic Int. (Albergo & Vanden-Eijnden, 2022) Flow 2.99 3.48 NVAE (Vahdat & Kautz, 2020) VAE 2.91 3.92 Image Transformer (Parmar et al., 2018) AR 2.90 3.77 VDVAE (Child, 2020) VAE 2.87 3.80 Score Flow (Song et al., 2021) Diff 2.83 3.76 Sparse Transformer (Child et al., 2019) AR 2.80 Reflected Diffusion Models (Lou & Ermon, 2023) Diff 2.68 3.74 VDM (Kingma et al., 2021) (10M steps) Diff 2.65 3.72 ARDM (Hoogeboom et al., 2021) AR 2.64 Flow Matching TN (Zheng et al., 2023) Flow 2.60 3.45

Our experiments (8M and 1.5M steps, 3 seed avg) VDM with v-parameterization Diff 2.64 3.46 Diff Enc Trainable (ours) Diff 2.62 3.46

Similarly to above, we derive the infinite-depth diffusion loss when the encoder is parameterized by Eq. (15) by taking the limit of LT for T (see Appendix J):

2Eϵ,t U[0,1]

ˆxθ(zt, λt) + σ2 t ˆxθ(zt, λt) xϕ(λt) dxϕ(λt)

where zt = αtxt + σtϵ with ϵ N(0, I).

v-parameterization. In our experiments we use the v-prediction parameterization (Salimans & Ho, 2022) for our loss, which means that for the trainable encoder we use the loss

2Eϵ,t U[0,1]

vt ˆvθ + σt

ˆxθ(λt) xϕ(λt) + yϕ(λt) dyϕ(λt)

and for the non-trainable encoder, we use

2Eϵ,t U[0,1] h λ tα2 t vt ˆvθ + σt (ˆxθ(λt) xϕ(λt)) 2 2 i . (20)

Derivations of Eqs. (19) and (20) are in Appendix K. We note that when using the v-parametrization, as t tends to 0, the loss becomes the same as for the ϵ-prediction parameterization. On the other hand, when t tends to 1, the loss has a different behavior depending on the encoder: For the trainable encoder, we have that ˆvθ dyϕ(λt)

dλt , suggesting that the encoder can in principle guide the diffusion model. See Appendix L for a more detailed discussion.

5 EXPERIMENTS

In this section, we present our experimental setup and discuss the results.

Published as a conference paper at ICLR 2024

Table 2: Comparison of the different components of the loss for Diff Enc-32-4 and VDMv-32 with fixed noise schedule on CIFAR-10. All quantities are in bits per dimension (BPD) with standard error over 3 seeds, and models are trained for 8M steps.

Model Total Latent Diffusion Reconstruction

VDMv-32 2.641 0.003 0.0012 0.0 2.629 0.003 0.01 (4 10 6) Diff Enc-32-4 2.620 0.006 0.0007 (3 10 6) 2.609 0.006 0.01 (4 10 6)

Figure 3: Comparison of unconditional samples of models. The small model struggles to make realistic images, while the large models are significantly better, as expected. For some images, details differ between the two large models, for others they disagree on the main element of the image. An example where the models make two different cars in column 9. An example where Diff Enc-32-4 makes a car and VDMv-32 makes a frog in column 7.

Experimental Setup. We evaluated variants of Diff Enc against a standard VDM baseline on MNIST (Le Cun et al., 1998), CIFAR-10 (Krizhevsky et al., 2009) and Image Net32 (Chrabaszcz et al., 2017). The learned prediction function is implemented as a U-Net (Ronneberger et al., 2015) consisting of convolutional Res Net blocks without any downsampling, following VDM (Kingma et al., 2021). The trainable encoder in Diff Enc is implemented with the same overall U-Net architecture, but with downsampling to resolutions 16x16 and 8x8. We will denote the models trained in our experiments by VDMv-n, Diff Enc-n-m, and Diff Enc-n-nt, where: VDMv is a VDM model with v parameterization, n and m are the number of Res Net blocks in the downsampling part of the v-prediction U-Net and of the encoder U-Net respectively, and nt indicates a non-trainable encoder for Diff Enc. On MNIST and CIFAR-10, we trained VDMv-8, Diff Enc-8-2, and Diff Enc-8-nt models. On CIFAR-10 we also trained Diff Enc-8-4, VDMv-32 and Diff Enc-32-4. On Image Net32, we trained VDMv-32 and Diff Enc-32-8.

We used a linear log SNR noise schedule: λt = λmax (λmax λmin) t. For the large models (VDMv-32, Diff Enc-32-4 and Diff Enc-32-8), we fixed the endpoints, λmax and λmin, to the ones Kingma et al. (2021) found were optimal. For the small models (VDMv-8, Diff Enc-8-2 and Diff Enc8-nt), we also experimented with learning the SNR endpoints. We trained all our models with either 3 or 5 seeds depending on the computational cost of the experiments. See more details on model structure and training in Appendix Q and on datasets in Appendix R.

Results. As we see in Table 1 the Diff Enc-32-4 model achieves a lower BPD score than previous non-flow work and the VDMv-32 on CIFAR-10. Since we do not use the encoder when sampling, this result means that the encoder is useful for learning a better generative model with higher likelihoods while sampling time is not adversely affected. We also see that VDMv-32 after 8M steps achieved a better likelihood bound, 2.64 BPD, than the result reported by Kingma et al. (2021) for the ϵ-parameterization after 10M steps, 2.65 BPD. Thus, the v-parameterization gives an improved likelihood compared to ϵ-parameterization. Table 2 shows that the difference in the total loss comes mainly from the improvement in diffusion loss for Diff Enc-32-4, which points to the encoder being helpful in the diffusion process. We provide Fig. 2, since it can be difficult to see what the encoder is doing directly from the encodings. From the heatmaps, we see that the encoder has learnt to do something different from how it was initialised and that it acts differently over t, making finer changes in earlier timesteps and more global changes in later timesteps. See Appendix W for more details. We note that the improvement in total loss is significant, since we get a p-value of 0.03 for

Published as a conference paper at ICLR 2024

Table 3: Comparison of the different components of the loss for Diff Enc-8-2, Diff Enc-8-nt and VDMv-8 on CIFAR-10. All quantities are in bits per dimension (BPD), with standard error, 5 seeds, 2M steps. Noise schedules are either fixed or with trainable endpoints.

Model Noise Total Latent Diffusion Reconstruction

VDMv-8 fixed 2.783 0.004 0.0012 0.0 2.772 0.004 0.010 (2 10 5) trainable 2.776 0.0006 0.0033 (2 10 5) 2.770 0.0006 0.003 (5 10 5) Diff Enc-8-2 fixed 2.783 0.004 0.0006 (3 10 5) 2.772 0.004 0.010 (3 10 6) trainable 2.783 0.003 0.0034 (2 10 5) 2.777 0.003 0.003 (5 10 5) Diff Enc-8-nt fixed 2.789 0.004 (1.6 10 5) 0.0 2.779 0.004 0.010 (1 10 5) trainable 2.786 0.004 0.0009 (1 10 5) 2.782 0.004 0.003 (3 10 5)

a t-test on whether the mean loss over random seeds is lower for Diff Enc-32-4 than for VDMv-32. Some samples from Diff Enc-8-2, Diff Enc-32-4, and VDMv-32 are shown in Fig. 3. More samples from Diff Enc-32-4 and VDMv-32 in Appendix T. See Fig. 4 in the appendix for examples of encoded MNIST images. Diff Enc-32-4 and VDMv-32 have similar FID scores as shown in Table 8.

For all models with a trainable encoder and fixed noise schedule, we see that the diffusion loss is the same or better than the VDM baseline (see Tables 2, 3 and 4 to 7). We interpret this as the trainable encoder being able to preserve the most important signal as t 1. This is supported by the results we get from the non-trainable encoder, which only removes signal, where the diffusion loss is always worse than the baseline. We also see that, for a fixed noise schedule, the latent loss of the trainable encoder model is always better than the VDM. When using a fixed noise schedule, the minimal and maximal SNR is set to ensure small reconstruction and latent losses. It is therefore natural that the diffusion loss (the part dependent on how well the model can predict the noisy image), is the part that dominates the total loss. This means that a lower latent loss does not necessarily have a considerable impact on the total loss: For fixed noise schedule, the Diff Enc-8-2 models on MNIST and CIFAR-10 and the Diff Enc-32-8 model on Image Net32 all have smaller latent loss than their VDMv counterparts, but since the diffusion loss is the same, the total loss does not show a significant change. However, Lin et al. (2023) pointed out that a high latent loss might lead to poor generated samples. Therefore, it might be relevant to train a model which has a lower latent loss than another model, if it can achieve the same diffusion loss. From Tables 3 and 4, we see that this is possible using a fixed noise schedule and a small trainable encoder. For results on a larger encoder with a small model see Appendix U.

We only saw an improvement in diffusion loss on the large models trained on CIFAR-10, and not on the small models. Since Image Net32 is more complex than CIFAR-10, and we did not see an improvement in diffusion loss for the models on Image Net32, a larger model might be needed on this dataset to see an improvement in diffusion loss. This would be interesting to test in future work.

For the trainable noise schedule, the mean total losses of the models are all lower than or equal to their fixed-schedule counterparts. Thus, all models can make some use use of this additional flexibility. For the fixed noise schedule, the reconstruction loss is the same for all three types of models, due to how our encoder is parameterized.

6 RELATED WORK

DDPM: Sohl-Dickstein et al. (2015) defined score-based diffusion models inspired by nonequilibrium thermodynamics. Ho et al. (2020) showed that diffusion models 1) are equivalent to scorebased models and 2) can be viewed as hierarchical variational autoencoders with a diffusion encoder and parameter sharing in the generative hierarchy. Song et al. (2020b) defined diffusion models using an SDE.

DDPM with encoder: To the best of our knowledge, only few previous papers consider modifications of the diffusion process encoder. Implicit non-linear diffusion models (Kim et al., 2022b) use an invertible non-linear time-dependent map, h, to bring the data into a latent space where they do linear diffusion. h can be compared to our encoder, however, we do not enforce the encoder to be invertible. Blurring diffusion models (Hoogeboom & Salimans, 2022; Rissanen et al., 2022) combines the

Published as a conference paper at ICLR 2024

added noise with a blurring of the image dependent on the timestep. This blurring can be seen as a Gaussian encoder with a mean which is linear in the data, but with a not necessarily iid noise. The encoder parameters are set by the heat dissipation basis (the discrete cosine transform) and time. Our encoder is a learned non-linear function of the data and time and therefore more general than blurring. Daras et al. (2022) propose introducing a more general linear corruption process, where both blurring and masking for example can be added before the noise. Latent diffusion (Rombach et al., 2022) uses a learned depth-independent encoder/decoder to map deterministically between the data and a learned latent space and perform the diffusion in the latent space. Abstreiter et al. (2021) and Preechakul et al. (2022) use an additional encoder that computes a small semantic representation of the image. This representation is then used as conditioning in the diffusion model and is therefore orthogonal to our work. Singhal et al. (2023) propose to learn the noising process: for zt = αtx + βtϵ, they propose to learn αt and βt.

Concurrent work: Bartosh et al. (2023) also propose to add a time-dependent transformation to the data in the diffusion model. However, there is a difference in the target for the predictive function, since in our case ˆxθ predicts the transformed data, xϕ, while in their case ˆxθ predicts data x such that the transformation of x , fϕ(x , t), is equal to the transformation of the real data fϕ(x, t). This might, according to their paper, make the prediction model learn something within the data distribution even for t close to 1.

Learned generative process variance: Both Nichol & Dhariwal (2021) and Dhariwal & Nichol (2021) learn the generative process variance, σP . Dhariwal & Nichol (2021) observe that it allows for sampling with fewer steps without a large drop in sample quality and Nichol & Dhariwal (2021) argue that it could have a positive effect on the likelihood. Neither of these works are in a continuoustime setting, which is the setting we derived our theoretical results for.

7 LIMITATIONS AND FUTURE WORK

As shown above, adding a trained time-dependent encoder can improve the likelihood of a diffusion model, at the cost of a longer training time. Although our approach does not increase sampling time, it must be noted that sampling is still significantly slower than, e.g., for generative adversarial networks (Goodfellow et al., 2014). Techniques for more efficient sampling in diffusion models (Watson et al., 2021; Salimans & Ho, 2022; Song et al., 2020a; Lu et al., 2022; Berthelot et al., 2023; Luhman & Luhman, 2021; Liu et al., 2022) can be directly applied to our method.

Introducing the trainable encoder opens up an interesting new direction for representation learning. It should be possible to distill the time-dependent transformations to get smaller time-dependent representations of the images. It would be interesting to see what such representations could tell us about the data. It would also be interesting to explore whether adding conditioning to the encoder will lead to different transformations for different classes of images.

As shown in (Theis et al., 2015) likelihood and visual quality of samples are not directly linked. Thus it is important to choose the application of the model based on the metric it was trained to optimize. Since we show that our model can achieve good results when optimized to maximize likelihood, and likelihood is important in the context of semi-supervised learning, it would be interesting to use this kind of model for classification in a semi-supervised setting.

8 CONCLUSION

We presented Diff Enc, a generalization of diffusion models with a time-dependent encoder in the diffusion process. Diff Enc increases the flexibility of diffusion models while retaining the same computational requirements for sampling. Moreover, we theoretically derived the optimal variance of the generative process and proved that, in the continuous-time limit, it must be equal to the diffusion variance for the ELBO to be well-defined. We defer the investigation of its application to sampling or discrete-time training to future work. Empirically, we showed that Diff Enc can improve likelihood on CIFAR-10, and that the data transformation learned by the encoder is non-trivially dependent on the timestep. Interesting avenues for future research include applying improvements to diffusion models that are orthogonal to our proposed method, such as latent diffusion models, model distillation, classifier-free guidance, and different sampling strategies.

Published as a conference paper at ICLR 2024

ETHICS STATEMENT

Since diffusion models have been shown to memorize training examples and since it is possible to extract these examples (Carlini et al., 2023), diffusion models pose a privacy and copyright risk especially if trained on data scraped from the internet. To the best of our knowledge our work neither improves nor worsens these security risks. Therefore, work still remains on how to responsibly deploy diffusion models with or without a time-dependent encoder.

REPRODUCIBILITY STATEMENT

The presented results are obtained using the setup described in Section 5. More details on models and training are discussed in Appendix Q. Code can be found on Git Hub1. The Readme includes a description of setting up the environment with correct versioning. Scripts are supplied for recreating all results present in the paper. The main equations behind these results are Eqs. (19) and (20), which are the diffusion losses used when including our trainable and non-trainable encoder, respectively.

ACKNOWLEDGMENTS

This work was supported by the Danish Pioneer Centre for AI, DNRF grant number P1, and by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90254). OW s work was funded in part by the Novo Nordisk Foundation through the Center for Basic Machine Learning Research in Life Science (NNF20OC0062606). AC thanks the ELLIS Ph D program for support.

Korbinian Abstreiter, Sarthak Mittal, Stefan Bauer, Bernhard Sch olkopf, and Arash Mehrjou. Diffusion-based representation learning. ar Xiv preprint ar Xiv:2105.14257, 2021.

Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. ar Xiv preprint ar Xiv:2209.15571, 2022.

Cedric Archambeau, Dan Cornford, Manfred Opper, and John Shawe-Taylor. Gaussian process approximations of stochastic differential equations. In Gaussian Processes in Practice, pp. 1 16. PMLR, 2007.

Grigory Bartosh, Dmitry Vetrov, and Christian A. Naesseth. Neural diffusion models, 2023.

David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbot, and Eric Gu. Tract: Denoising diffusion models with transitive closure time-distillation. ar Xiv preprint ar Xiv:2303.04248, 2023.

Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 5253 5270, 2023.

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. ar Xiv preprint ar Xiv:2009.00713, 2020.

Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. ar Xiv preprint ar Xiv:2011.10650, 2020.

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. ar Xiv preprint ar Xiv:1904.10509, 2019.

Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. ar Xiv preprint ar Xiv:1707.08819, 2017.

Giannis Daras, Mauricio Delbracio, Hossein Talebi, Alexandros G Dimakis, and Peyman Milanfar. Soft diffusion: Score matching for general corruptions. ar Xiv preprint ar Xiv:2209.05442, 2022.

1https://github.com/bemigini/Diff Enc

Published as a conference paper at ICLR 2024

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. Advances in neural information processing systems, 34:8780 8794, 2021.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. Advances in Neural Information Processing Systems, 35: 27953 27965, 2022.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6840 6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf.

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. ar Xiv preprint ar Xiv:2210.02303, 2022.

Emiel Hoogeboom and Tim Salimans. Blurring diffusion models. ar Xiv preprint ar Xiv:2209.05557, 2022.

Emiel Hoogeboom, Alexey A Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, and Tim Salimans. Autoregressive diffusion models. ar Xiv preprint ar Xiv:2110.02037, 2021.

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. ar Xiv preprint ar Xiv:2301.11093, 2023.

Tobias H oppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling. Transactions on Machine Learning Research, 2022.

Qingqing Huang, Daniel S Park, Tao Wang, Timo I Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, et al. Noise2music: Text-conditioned music generation with diffusion models. ar Xiv preprint ar Xiv:2302.03917, 2023.

Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, and Nam Soo Kim. Diff-tts: A denoising diffusion model for text-to-speech. ar Xiv preprint ar Xiv:2104.01409, 2021.

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusionbased generative models. Advances in Neural Information Processing Systems, 35:26565 26577, 2022.

Dongjun Kim, Yeongmin Kim, Wanmo Kang, and Il-Chul Moon. Refining generative process with discriminator guidance in score-based diffusion models. ar Xiv preprint ar Xiv:2211.17091, 2022a.

Dongjun Kim, Byeonghu Na, Se Jung Kwon, Dongsoo Lee, Wanmo Kang, and Il-chul Moon. Maximum likelihood training of implicit nonlinear diffusion model. Advances in Neural Information Processing Systems, 35:32270 32284, 2022b.

Diederik Kingma and Ruiqi Gua. Vdm++: Variational diffusion models for high-quality synthesis. ar Xiv preprint ar Xiv:2303.00848, 2023. URL https://arxiv.org/abs/2303.00848.

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 21696 21707. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/ b578f2a52a0229873fefc2a4b06377fa-Paper.pdf.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Published as a conference paper at ICLR 2024

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. ar Xiv preprint ar Xiv:2009.09761, 2020.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. ar Xiv preprint ar Xiv:2305.08891, 2023.

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. ar Xiv preprint ar Xiv:2210.02747, 2022.

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. ar Xiv preprint ar Xiv:2209.03003, 2022.

Aaron Lou and Stefano Ermon. Reflected diffusion models. ar Xiv preprint ar Xiv:2304.04740, 2023.

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775 5787, 2022.

Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. ar Xiv preprint ar Xiv:2101.02388, 2021.

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162 8171. PMLR, 2021.

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In International conference on machine learning, pp. 4055 4064. PMLR, 2018.

Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10619 10629, 2022.

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pp. 1278 1286. PMLR, 2014.

Severi Rissanen, Markus Heinonen, and Arno Solin. Generative modelling with inverse heat dissipation. ar Xiv preprint ar Xiv:2206.13397, 2022.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234 241. Springer, 2015.

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. ar Xiv preprint ar Xiv:2202.00512, 2022.

Flavio Schneider, Zhijing Jin, and Bernhard Sch olkopf. Mo\ˆ usai: Text-to-music generation with long-context latent diffusion. ar Xiv preprint ar Xiv:2301.11757, 2023.

Raghav Singhal, Mark Goldstein, and Rajesh Ranganath. Where to diffuse, how to diffuse, and how to get back: Automated learning for multivariate diffusions. ar Xiv preprint ar Xiv:2302.07261, 2023.

Published as a conference paper at ICLR 2024

Samarth Sinha and Adji Bousso Dieng. Consistency regularization for variational auto-encoders. Advances in Neural Information Processing Systems, 34:12943 12954, 2021.

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265. PMLR, 2015.

Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. Advances in neural information processing systems, 29, 2016.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020a.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020b.

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of scorebased diffusion models. Advances in Neural Information Processing Systems, 34:1415 1428, 2021.

Lucas Theis, A aron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. ar Xiv preprint ar Xiv:1511.01844, 2015.

Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. Advances in neural information processing systems, 33:19667 19679, 2020.

Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287 11302, 2021.

A aron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International conference on machine learning, pp. 1747 1756. PMLR, 2016.

Daniel Watson, Jonathan Ho, Mohammad Norouzi, and William Chan. Learning to efficiently sample from diffusion probabilistic models. ar Xiv preprint ar Xiv:2106.03802, 2021.

Guangcong Zheng, Shengming Li, Hui Wang, Taiping Yao, Yang Chen, Shouhong Ding, and Xi Li. Entropy-driven sampling and training scheme for conditional diffusion generation. In European Conference on Computer Vision, pp. 754 769. Springer, 2022.

Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Improved techniques for maximum likelihood estimation for diffusion odes. ar Xiv preprint ar Xiv:2305.03935, 2023.

Published as a conference paper at ICLR 2024

Table of Contents

A Overview of diffusion model with and without encoder 15

B Proof that zt given x has the correct form 15

C Proof that the reverse process has the correct form 16

D The latent and reconstruction loss 17 D.1 Latent Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 D.2 Reconstruction Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

E Diffusion loss 18 E.1 Assuming non-equal variances in the diffusion and generative processes . . . . . 18 E.2 Assuming equal variances in the diffusion and generative processes . . . . . . . 19

F Optimal variance for the generative model 19

G Diffusion loss in continuous time without counterterm 19

H Diff Enc as an SDE 22

I Motivation for choice of parameterization for the encoder 23

J Continuous-time limit of the diffusion loss with an encoder 23 J.1 Rewriting the loss using SNR . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 J.2 Taking the limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

K Using the v Parameterization in the Diff Enc Loss 25 K.1 Rewriting the v parameterization . . . . . . . . . . . . . . . . . . . . . . . . . 25 K.2 v parameterization in continuous diffusion loss . . . . . . . . . . . . . . . . . . 26 K.3 v parameterization of continuous diffusion loss with encoder . . . . . . . . . . . 26

L Considering loss for early and late timesteps 27

M Detailed Loss Comparison for Diff Enc and VDMv on MNIST 28

N Detailed Loss Comparison for Diff Enc-32-2 and VDMv-32 on CIFAR-10 28

O Detailed Loss Comparison for Diff Enc and VDMv on Image Net32 29

P Further Future Work 29

Q Model Structure and Training 29

R Datasets 30

S Encoder examples on MNIST 30

T Samples from models 30

U Using a Larger Encoder for a Small Diffusion Model 31

V FID Scores 32

W Sum heatmap of all timesteps 32

Published as a conference paper at ICLR 2024

A OVERVIEW OF DIFFUSION MODEL WITH AND WITHOUT ENCODER

The typical diffusion approach can be illustrated with the following diagram:

zt ... zs ... z0 x

where 0 s < t 1. We introduce an encoder fϕ : X [0, 1] Y with parameters ϕ, that maps x and a time t [0, 1] to a latent space Y. In this work, Y has the same dimensions as the original image space. For brevity, we denote the encoded data as xt fϕ(x, t). The following diagram illustrates the process including the encoder:

zt ... zs ... z0

xt ... xs ... x0 x

q(zt|xt) q(zs|xs) q(z0|x0)

B PROOF THAT zt GIVEN x HAS THE CORRECT FORM

Proof that we can write q(zt|x) = N(αtxt, σ2 t I) (21) for any t when using the definition q(z0|x0) = q(z0|x) and Eq. (11).

Proof. By induction:

The definition of q(z0|x0) = q(z0|x) gives us our base case.

To take a step, we assume q(zs|xs) = q(zs|x) can be written as

q(zs|x) = N(αsxs, σ2 s I) (22) and take a t > s.

Then a sample from q(zs|x) can be written as zs = αsxs + σsϵs (23) where ϵs is from a standard normal distribution, N(0, I) and a sample from q(zt|zs, xt, xs) = q(zt|zs, x) can be written as zt = αt|szs + αt(xt xs) + σt|sϵt|s (24)

where ϵt|s is from a standard normal distribution, N(0, I). Using the definition of zs, we get

zt = αt|s(αsxs + σsϵs) + αt(xt xs) + σt|sϵt|s = αtxs + αt|sσsϵs + αtxt αtxs + σt|sϵt|s = αtxt + αt|sσsϵs + σt|sϵt|s

Published as a conference paper at ICLR 2024

Since αt|sσsϵs and σt|sϵt|s describe two normal distributions, a sample from the sum can be written as q

α2 t|sσ2s + σ2 t|sϵt (26)

where ϵt is from a standard normal distribution, N(0, I). So we can write our sample zt as

zt = αtxt + q

α2 t|sσ2s + σ2 t|sϵt

α2 t|sσ2s + σ2 t α2 t|sσ2sϵt

= αtxt + σtϵt

Thus we get q(zt|x) = N(αtxt, σ2 t I) (28)

for any 0 t 1. We have defined going from x to zt as going through f.

C PROOF THAT THE REVERSE PROCESS HAS THE CORRECT FORM

To see that q(zs|zt, xt, xs) = N(µQ, σ2 QI) (29)

σ2 Q = σ2 t|sσ2 s σ2 t (30)

µQ = αt|sσ2 s σ2 t zt + αsσ2 t|s σ2 t xt + αs(xs xt) (31)

is the right form for the reverse process, we take a sample zs from q(zs|zt, xt, xs) and a sample zt from q(zt|x). These have the forms:

zt = αtxt + σtϵt (32)

zs = αt|sσ2 s σ2 t zt + αsσ2 t|s σ2 t xt + αs(xs xt) + σ2 QϵQ (33)

We show that given zt we get zs from q(zs|x) as in Eq. (10).

zs = αt|sσ2 s σ2 t (αtxt + σtϵt) + αsσ2 t|s σ2 t xt + αs(xs xt) + σ2 QϵQ (34)

= αtαt|sσ2 s σ2 t xt + αs σ2 t α2 t|sσ2 s

σ2 t xt + αs(xs xt) + αt|sσ2 s σ2 t σtϵt + σ2 QϵQ (35)

Since αt = αsαt|s we have

zs = αsα2 t|sσ2 s σ2 t xt + αs σ2 t α2 t|sσ2 s

σ2 t xt + αs(xs xt) + αt|sσ2 s σ2 t σtϵt + σ2 QϵQ (37)

= αsα2 t|sσ2 s σ2 t xt αsα2 t|sσ2 s σ2 t xt + αsσ2 t σ2 t xt + αs(xs xt) + αt|sσ2 s σ2 t σtϵt + σ2 QϵQ (38)

= αsxt + αs(xs xt) + αt|sσ2 s σ2 t σtϵt + σ2 QϵQ (39)

= αsxs + αt|sσ2 s σ2 t σtϵt + σ2 QϵQ (40)

Published as a conference paper at ICLR 2024

We now use that σ2 Q = σ2 t|sσ2 s σ2 t and the sum rule of variances, σ2 X+Y = σ2 X + σ2 Y + 2COV (X, Y ), where the covariance is zero since ϵt and ϵQ are independent.

zs = αsxs + αt|sσ2 s σ2 t σtϵt + σ2 QϵQ (42)

= αsxs + αt|sσ2 s σt ϵt + σt|sσs

σ2 t + σ2 t|sσ2s

σ2 t ϵs (44)

σ2 t + (σ2 t α2 t|sσ2s)σ2s σ2 t ϵs (45)

σ2 t σ2s σ2 t ϵs (46)

= αsxs + σsϵs (47)

Where ϵs is from a standard Gaussian distribution.

D THE LATENT AND RECONSTRUCTION LOSS

D.1 LATENT LOSS

Since q(z1|x) = N(α1x1, σ2 1I) and p(z1) = N(0, I), the latent loss, DKL(q(z1|x)||p(z1)), is the KL divergence of two normal distributions. For normal distributions N0, N1 with means µ0, µ1 and variances Σ0, Σ1, the KL divergence between them is given by

DKL(N0||N1) = 1

tr Σ 1 1 Σ0 d + (µ1 µ0)T Σ 1 1 (µ1 µ0) + log det Σ1

where d is the dimension. Therefore we have:

DKL(q(z1|x)||p(z1)) = 1

tr σ2 1I d + ||0 α1x1||2 + log 1 det σ2 1I

2 ||α1x1||2 + d σ2 1 log σ2 1 1 (50)

α2 1x2 1,i + σ2 1 log σ2 1 1 !

The last line is used in our implementation.

D.2 RECONSTRUCTION LOSS

The reconstruction loss is given by

L0 = Eq(z0|x) [ log p(x|z0)] . (52)

We make the simplifying assumption that p(x|z0) factorizes over the elements of x. Let xi be the value of the ith dimension (i.e., pixel) of x and z0,i the corresponding pixel value of z0:

p(x|z0) = Y

i p(xi|z0,i) . (53)

In our case of images, we assume the pixel values are independent given z0 and only dependent on the matching latent component. We construct p(xi|z0,i) from the variational distribution noting that

q(x|z0) = q(z0|x)q(x)

Published as a conference paper at ICLR 2024

and for high enough SNR at t = 0, q(z0|x) will be very peaked around z0 = α0x. So we can choose

p(xi|z0,i) q(z0,i|xi) = N(z0,i; α0xi, σ2 0) , (55)

where we normalize over all possible values of xi. That is, let v {0, ..., 255} be the possible pixel values of xi, then for each v we calculate the density N(α0v, σ2 0) at z0,i and then normalise over v to get a categorical distribution p(xi|z0,i) that sums to 1.

E DIFFUSION LOSS

The diffusion loss is

i=1 Eq(zt(i)|x)DKL(q(zs(i)|zt(i), x)||pθ(zs(i)|zt(i))) , (56)

where s(i), t(i) are the values of 0 s < t 1 corresponding to the ith timestep.

E.1 ASSUMING NON-EQUAL VARIANCES IN THE DIFFUSION AND GENERATIVE PROCESSES

In this section we let

pθ(zs|zt) = N(µP , σ2 P I) (57)

q(zs|zt, x) = N(µQ, σ2 QI) , (58)

where we might have σP = σQ. We then get for the KL divergence, where d is the dimension,

DKL(q(zs|zt, x) pθ(zs|zt)) = DKL(N(zs; µQ, σ2 QI) N(zs; µP , σ2 P I))

σ2 Q σ2 P I

d + (µP µQ)T 1

σ2 P I(µP µQ) + log det σ2 P I det σ2 QI

d σ2 Q σ2 P d + 1

σ2 P µP µQ 2 2 + log (σ2 P )d

σ2 Q σ2 P 1 + log σ2 P σ2 Q

+ 1 2σ2 P µP µQ 2 2 . (59)

If we define

wt = σ2 Q σ2 P (60)

Then we can write the KL divergence as

DKL(q(zs|zt, x) pθ(zs|zt)) = d

2 (wt 1 log wt) + wt

2σ2 Q µP µQ 2 2 (61)

Using our definition of µP

µP = αt|sσ2 s σ2 t zt + αsσ2 t|s σ2 t ˆxθ(λt) + αs(λs λt)(σ2 t ˆxθ(λt)) (62)

we can rewrite the second term of the loss as: wt σ2 Q µP µQ 2 2 (63)

αsσ2 t|s σ2 t (ˆxθ(t) xϕ(λt)) + αs (λs λt)σ2 t ˆxθ(t) (xϕ(λs) xϕ(λt))

Where we have dropped the dependence on λ from our notation of ˆxθ(λt) and xϕ(λt), to make the equation fit on the page.

Published as a conference paper at ICLR 2024

E.2 ASSUMING EQUAL VARIANCES IN THE DIFFUSION AND GENERATIVE PROCESSES

pθ(zs|zt) = N(µP , σ2 QI) (64)

q(zs|zt, x) = N(µQ, σ2 QI) (65)

we get for the KL divergence

DKL(q(zs|zt, x) pθ(zs|zt)) = DKL(N(zs; µQ, σ2 QI) N(zs; µP , σ2 QI))

= 1 2σ2 Q µP µQ 2 2

αsσ2 t|s σ2 t (ˆxθ(t) xϕ(λt)) + αs (λs λt)σ2 t ˆxθ(t) (xϕ(λs) xϕ(λt))

F OPTIMAL VARIANCE FOR THE GENERATIVE MODEL

In this section, we compute the optimal variance σ2 P of the generative model in closed-form.

Consider the expectation over the data distribution of the KL divergence in the diffusion loss (Appendix E):

Eq(x,zt) [DKL(q(zs|zt, x) pθ(zs|zt))] = d

σ2 Q σ2 P 1 + log σ2 P σ2 Q

+ 1 2σ2 P Eq(x,zt) µP µQ 2 2

and differentiate it w.r.t. σ2 P :

σ2 Q σ4 P + 1

1 2σ4 P Eq(x,zt) µP µQ 2 2 (68)

dσ2 P dσ2 Q Eq(x,zt) µP µQ 2 2 (69)

The derivative is zero when:

σ2 P = σ2 Q + 1

d Eq(x,zt) µP µQ 2 2 (70)

Since the second derivative of the KL at this value of σ2 P is positive, this is a minimum of the KL divergence.

G DIFFUSION LOSS IN CONTINUOUS TIME WITHOUT COUNTERTERM

In this section, we consider the Diff Enc diffusion process with mean shift term, coupled with the original VDM generative process (see Section 3). We show that in the continuous-time limit the optimal variance σ2 P tends to σ2 Q and the resulting diffusion loss simplifies to the standard VDM diffusion loss. We finally derive the diffusion loss in the continuous-time limit.

We start by rewriting the diffusion loss as expectation, using constant step size τ 1/T and denoting ti i/T:

LT (x) = T Ei U{1,T }Eq(zti|x) [DKL(q(zti τ|zti, x) pθ(zti τ|zti))] (71)

= T Et U{τ,2τ,...,1}Eq(zt|x) [DKL(q(zt τ|zt, x) pθ(zt τ|zt))] , (72)

where we dropped indices and directly sample the discrete rv t.

The KL divergence can be calculated in closed form (Appendix E) because all distributions are Gaussian:

DKL(q(zt τ|zt, x) pθ(zt τ|zt)) = d

2 (wt 1 log wt) + wt

2σ2 Q µP µQ 2 2 , (73)

Published as a conference paper at ICLR 2024

where we have defined the weighting function

wt = σ2 Q,t σ2 P,t (74)

Insert this in the diffusion loss:

LT (x) = Et U{τ,2τ,...,1}

" d 2τ (wt 1 log wt) + 1

" wt 2σ2 Q µP µQ 2 2

Given the optimal value for the noise variance in the generative model derived above (Appendix F):

σ2 P = σ2 Q + 1

d Eq(x,zt) µP µQ 2 2

we get the optimal wt:

w 1 t = 1 + 1 σ2 Qd Eq(x,zt) µP µQ 2 2 .

Using the following definitions:

µQ = αt|sσ2 s σ2 t zt + αsσ2 t|s σ2 t xt + αs(xs xt)

µP = αt|sσ2 s σ2 t zt + αsσ2 t|s σ2 t ˆxθ(zt, t)

σ2 Q = σ2 t|sσ2 s σ2 t

σ2 t|s = σ2 t α2 t α2s σ2 s

and these intermediate results:

α2 sσ4 t|s σ4 t = SNR(s) SNR(t)

σ2 t|s σ2 t = 1 α2 tσ2 s α2sσ2 t = SNR(s) SNR(t)

SNR(s) we can write:

µP µQ 2 2 2σ2 Q = 1 2σ2 Q

αsσ2 t|s σ2 t xt + αs(xs xt) αsσ2 t|s σ2 t ˆxθ(zt, t)

α2 sσ4 t|s σ4 t

xt + σ2 t σ2 t|s (xs xt) ˆxθ(zt, t)

2 SNR xt + SNR(s)

SNR x ˆxθ(zt, t)

2 where we used the shorthand SNR SNR(t) SNR(s) and x xt xs.

The optimal wt tends to 1. The optimal wt can be rewritten as follows:

w 1 t = 1 + 1 σ2 Qd Eq(x,zt) µP µQ 2 2 (76)

" xt + SNR(s)

SNR x ˆxθ(zt, t)

As T , or equivalently s t and τ = s t 0, the optimal wt tends to 1, corresponding to the unweighted case (forward and backward variance are equal).

Published as a conference paper at ICLR 2024

The first term of diffusion loss tends to zero. Define:

" xt + SNR(s)

SNR x ˆxθ(zt, t)

such that the optimal wt is given by w 1 t = 1 ν

Then we are interested in the term

wt 1 log wt = ν 1 ν + log(1 ν)

As τ 0, we have

SNR = τ d SNR(t)

dt + O(τ 2)

x = τ dxϕ(λt)

dt + O(τ 2)

dt Eq(x,zt)

" xt + dxϕ(λt)

d log SNR ˆxθ(zt, t)

Since ν 0, we can write a series expansion around ν = 0:

wt 1 log wt = 1

2ν2 + O(ν3)

τ d d SNR(t)

dt Eq(x,zt)

" xt + dxϕ(λt)

d log SNR ˆxθ(zt, t)

The first term of the weighted diffusion loss LT is then 0, since as τ 0 we get:

d 2τ Et U{τ,2τ,...,1} [wt 1 log wt] = O(τ) (78)

Note that, had we simply used wt = 1 + O(τ), we would only be able to prove that this term in the loss is finite, but not whether it is zero. Here, we showed that the additional term in the loss actually tends to zero as T .

In fact, we can also observe that, if σP = σQ and therefore wt 1 log wt > 0, the first term in the diffusion loss diverges in the continuous-time limit, so the ELBO is not well-defined.

Continuous-time limit of the diffusion loss. We saw that, as τ 0, wt 1 and the first term in the diffusion loss LT tends to zero. The limit of LT then becomes

L (x) = lim T LT (x) (79)

= lim T Et U{τ,2τ,...,1}Eq(zt|x)

" 1 2τσ2 Q µP µQ 2 2

= lim T Et U{τ,2τ,...,1}Eq(zt|x)

2τ d SNR(t)

xt + SNR(t) dxϕ(λt)

dt τ ˆxθ(zt, t)

2Et U(0,1)Eq(zt|x)

xt + dxϕ(λt)

d log SNR ˆxθ(zt, t)

Published as a conference paper at ICLR 2024

H DIFFENC AS AN SDE

A diffusion model may be seen as a discretization of an SDE. The same is true for the depth dependent encoder model. The forward process Eq. (11) can be written as

zt = αt|szs + αt(xϕ(t) xϕ(s)) + σt|sϵ . (83)

Let 0 < t < 1 such that t = s + t. If we consider the first term after the equality sign we see that αt αs = αs + αt αs

= 1 + αt αs

αs t t (85)

So we get that

zt zs = αt αs

αt t zs t + αt(xϕ(t) xϕ(s)) + σt|sϵ (86)

Considering the second term, we get

αt(xϕ(t) xϕ(s)) = αt xϕ(t) xϕ(s)

So if we define

f t(zs, s) = αt αs

αt t zs + αt xϕ(t) xϕ(s)

we can write zt zs = f t(zs, s) t + σt|tϵ (89)

We will now consider σ2 t|t to be able to rewrite σt|sϵ

σ2 t|t = σ2 t α2 t α2s σ2 s (90)

σ2 t α2 t σ2 s α2s

σ2 t α2 t σ2 s α2s

Thus, if we define

σ2 t α2 t σ2s

we can write zt zs = f t(zs, s) t + g t(s)

tϵ (94) We can now take the limit t 0 using the definition t = s + t:

f t(zs, s) 1

ds zs + αs dxϕ(s)

ds = d log αs

ds zs + αs dxϕ(s)

So if we use these limits to define the functions

f(zt, t, x) = 1

dt zt + αt dxϕ(t)

dσ2 t /α2 t dt . (98)

we can write the forward stochastic process when using a time dependent encoder as

dz =f(zt, t, x)dt + g(t)dw (99)

Published as a conference paper at ICLR 2024

where dw is the increment of a Wiener process over time t. The diffusion process of Diff Enc in the continuous-time limit is therefore similar to the usual SDE for diffusion models (Song et al., 2020b), with an additional contribution to the drift term.

Given the drift and the diffusion coefficient we can write the generative model as a reverse-time SDE (Song et al., 2020b):

dz = f(zt, t, x) g2(t) zt log p(zt) dt + g(t)d w , (100)

where d w is a reverse-time Wiener process.

I MOTIVATION FOR CHOICE OF PARAMETERIZATION FOR THE ENCODER

As mentioned in Section 4, we would like our encoding to be helpful for the reconstruction loss at t = 0 and for the latent loss at t = 1. Multiplying the data with αt will give us these properties, since it will be close to the identity at t = 0 and send everything to 0 at t = 1. Instead of just multiplying with αt, we choose to use α2 t, since we still get the desirable properties for t = 0 and t = 1, but it makes some of the mathematical expressions nicer (for example the derivative). Thus, we arrive at the non-trainable parameterization

xnt(λt) = α2 tx (101)

At t = 1, we see that all the values of xnt(λt) are very close to 0, which should be easy to approximate from z1 = α3 1x + σ1ϵ, since it will just be the mean of the values in z1. Note that this parameterization gives us a lower latent loss, since the values of α3 tx are closer to zero than the values of αtx. However, this is not the same as just using a smaller minimum λt in the original formulation, since in the original formulation the diffusion model would still be predicting x and not α2 tx 0 at t = 1. There is still a problem with this formulation, since if we look at what happens between t = 0 and t = 1 we see that at some point, we will be attempting to approximate vt = αtϵ σtxnt(x, λt) from a very noisy zt while the values of xnt(x, λt) are still very small. In other words, since zt is a noisy version of xnt(x, λt) and xnt(x, λt) has very small values there will not be much signal, but as we move away from t = 1, 0 will also become a worse and worse approximation.

This is why we introduce the trainable encoder

xϕ(λt) = x σ2 t x + σ2 t yϕ(x, λt) (102)

= α2 tx + σ2 t yϕ(x, λt) (103)

Here we allow the inner encoder yϕ(x, λt) to add signal dependent on the image at the same pace as we are removing signal via the σ2 t x term. This should give us a better diffusion loss between t = 0 and t = 1, but still has xϕ(λt) very close to x at t = 0.

J CONTINUOUS-TIME LIMIT OF THE DIFFUSION LOSS WITH AN ENCODER

J.1 REWRITING THE LOSS USING SNR

We can express the KL divergence in terms of the SNR:

SNR(t) = α2 t σ2 t . (104)

We pull αsσ2 t|s σ2 t outside, expand σ2 Q, and use the definition of the SNR to get:

α2 sσ4 t|s σ4 t = 1

2 (SNR(s) SNR(t)) (105)

We also see that

σ2 t|s σ2 t = σ2 t α2 t|sσ2 s σ2 t = 1 α2 tσ2 s α2sσ2 t = 1 SNR(t)

SNR(s) = SNR(s) SNR(t)

SNR(s) . (106)

Published as a conference paper at ICLR 2024

Inserting this back into Eq. (66), we get: 1 2σ2 Q µP µQ 2 2 (107)

2 (SNR(s) SNR(t)) (108) ˆxθ(λt) xϕ(λt) + SNR(s) (λs λt)σ2 t ˆxθ(λt) (xϕ(λs) xϕ(λt))

SNR(s) SNR(t)

2 The KL divergence is then:

DKL(q(zs|zt, x) pθ(zs|zt)) (109)

2 (SNR(s) SNR(t)) ˆxθ(λt) xϕ(λt) + SNR(s) (λs λt)σ2 t ˆxθ(λt) (xϕ(λs) xϕ(λt))

SNR(s) SNR(t)

2 with s = i 1

T and t = i

J.2 TAKING THE LIMIT

If we rewrite everything in the loss from Eq. (109) to be with respect to λt, we get

2 Eϵ,i U{1,T } eλs eλt (110) ˆxθ(λt) xϕ(λt) + eλs (λs λt)σ2 t ˆxθ(λt) (xϕ(λs) xϕ(λt))

Where s = (i 1)/T and t = i/T. We now want to take the continuous limit. Outside the norm we get the derivative with respect to t, inside the norm, we want the derivative w.r.t. λt. First we consider eλs eλt

For T and see that eλs eλt

dt = eλt λ t (112)

Where λ t is the derivative of λt w.r.t. t. Inside the norm we get for s t that

eλs λt λs (eλt eλs) (xϕ(λt) xϕ(λs))

λt λs eλt 1

eλt dxϕ(λt)

eλs eλt (λs λt)σ2 t ˆxθ(λt) = eλs (λt λs)

(eλt eλs)σ2 t ˆxθ(λt) (116)

eλt σ2 t ˆxθ(λt) = σ2 t ˆxθ(λt) (117)

So we get the loss

2Eϵ,t U[0,1]

λ teλt ˆxθ(λt) xϕ(λt) + σ2 t ˆxθ(λt) dxϕ(λt)

Published as a conference paper at ICLR 2024

K USING THE V PARAMETERIZATION IN THE DIFFENC LOSS

In the following subsections we describe the v-prediction parameterization (Salimans & Ho, 2022) and derive the v-prediction loss for the proposed model, Diff Enc. We start by defining:

vt = αtϵ σtxϕ(λt) (119) ˆvθ(λt) = αtˆϵθ σtˆxθ(λt) (120)

which give us (see Appendix K.1):

xϕ(λt) = αtzt σtvt (121) ˆxθ(λt) = αtzt σtˆvθ(λt) (122)

where we learn the v-prediction function ˆvθ(λt) = ˆvθ(zλt, λt). In Appendix K.2 we show that, using this parameterization in Eq. (18), the loss becomes:

2Eϵ,t U[0,1]

vϕ(λt) ˆvθ(λt) + σtˆxθ(λt) 1

As shown in Appendix K.3, the diffusion loss for the trainable encoder from Eq. (15) becomes:

2Eϵ,t U[0,1]

vt ˆvθ + σt

ˆxθ(λt) xϕ(λt) + yϕ(λt) dyϕ(λt)

and for the non-trainable encoder:

2Eϵ,t U[0,1] h λ tα2 t vt ˆvθ + σt (ˆxθ(λt) xϕ(λt)) 2 2 i . (125)

Eqs. (124) and (125) are the losses we use in our experiments.

K.1 REWRITING THE V PARAMETERIZATION

In the v parameterization of the loss from (Salimans & Ho, 2022), vt is defined as

vt = αtϵ σtx (126)

We use the generalization vt = αtϵ σtxϕ(x, λt) (127) Note that since xϕ(x, λt) = (zt σtϵ)/αt (128) and α2 t + σ2 t = 1 (129) we get

xϕ(x, λt) = (zt σtϵ)/αt (130)

= ((α2 t + σ2 t )zt σt(α2 t + σ2 t )ϵ)/αt (131)

= αt + σ2 t αt

zt σtαt + σ3 t αt

= αtzt + σ2 t αt zt σtαtϵ σ3 t αt ϵ (133)

αt zt + σ2 t αt ϵ (134)

αt (zt σtϵ) (135)

= αtzt σt (αtϵ σtxϕ(x, λt)) (136) = αtzt σtvt (137) (138)

Published as a conference paper at ICLR 2024

xϕ(x, λt) = αtzt σtvt (139)

Therefore we define

ˆvθ(λt) = αtˆϵθ σtˆxθ(λt) (140)

which in the same way gives us

ˆxθ(zλt, λt) = αtzt σtˆvθ (141)

where we learn ˆvθ.

K.2 V PARAMETERIZATION IN CONTINUOUS DIFFUSION LOSS

For the v parameterization we have

vϕ(λt) = αtϵ σtxϕ(λt) (142)

where ϵ is from a standard normal distribution, N(0, I), and

xϕ(λt) = αtzt σtvϕ(λt) (143)

So we will set

ˆxθ(λt) = αtzt σtˆvθ(λt) (144)

ˆvθ(λt) = αtˆϵθ σtˆxθ(λt) (145)

where we learn ˆvθ(λt). If we rewrite the second term within the square brackets of Eq. (118) using the v parameterization, we get:

λ (t)eλt ˆxθ(λt) xϕ(λt) + σ2 t ˆxθ(λt) dxϕ(λt)

= λ (t)eλt σtvϕ(λt) σtˆvθ(λt) + σ2 t ˆxθ(λt) dxϕ(λt)

= λ (t)α2 t

vϕ(λt) ˆvθ(λt) + σtˆxθ(λt) 1

So we get the loss

2Eϵ,t U[0,1]

vϕ(λt) ˆvθ(λt) + σtˆxθ(λt) 1

K.3 V PARAMETERIZATION OF CONTINUOUS DIFFUSION LOSS WITH ENCODER

We recall our two parameterizations of the encoder

xϕ(λt) = x σ2 t x + σ2 t yϕ(x, λt) (150)

= α2 tx + σ2 t yϕ(x, λt) (151)

xnt(λt) = α2 tx (152)

We see that dxϕ(λt)

dλt = α2 tσ2 t x + σ2 t dyϕ(λt)

dλt α2 tσ2 t yϕ (153)

and dxnt(λt)

dλt = α2 tσ2 t x (154)

Published as a conference paper at ICLR 2024

as mentioned before. We first consider the loss for our trainable encoder. Focusing on the part of Eq. (123) inside the norm, and dropping the dependencies on λt for brevity, we get

vϕ ˆvθ + σtˆxθ 1

= vϕ ˆvθ + σtˆxθ 1

α2 tσ2 t x + σ2 t dyϕ(λt)

dλt α2 tσ2 t yϕ

= vϕ ˆvθ + σtˆxθ α2 tσtx σt dyϕ(λt)

dλt + α2 tσtyϕ (157)

= vϕ ˆvθ + σt

ˆxθ α2 tx dyϕ(λt)

dλt + α2 tyϕ

= vϕ ˆvθ + σt

ˆxθ α2 tx + (1 σ2 t )yϕ dyϕ(λt)

= vϕ ˆvθ + σt

ˆxθ α2 tx σ2 t yϕ + yϕ dyϕ(λt)

= vϕ ˆvθ + σt

ˆxθ xϕ + yϕ dyϕ(λt)

So for the trainable encoder, we get the loss

2Eϵ,t U[0,1] (162) "

vt ˆvθ + σt

ˆxθ(λt) xϕ(λt) + yϕ(λt) dyϕ(λt)

For the non-trainable encoder, if we again focus on the part of Eq. (123) inside the norm, and dropping the dependencies on λt for brevity, we get

vϕ ˆvθ + σtˆxθ 1

= vϕ ˆvθ + σtˆxθ 1

α2 tσ2 t x (164)

= vϕ ˆvθ + σtˆxθ σt α2 tx (165)

= vϕ ˆvθ + σt ˆxθ α2 tx (166)

= vϕ ˆvθ + σt (ˆxθ xnt) (167)

So for the non-trainable encoder, we get the loss

2Eϵ,t U[0,1] h λ (t)α2 t vt ˆvθ + σt (ˆxθ(λt) xnt(λt)) 2 2 i (168)

L CONSIDERING LOSS FOR EARLY AND LATE TIMESTEPS

Let us consider what happens to the expression inside the norm from our loss Eq. (19) for t close to zero. We see that since αt 1 and σt 0 for t 0 and ˆvθ = αtˆϵθ σtˆxθ(λt), we get for the trainable encoder vt ˆvθ + σt

ˆxθ(λt) xϕ(λt) + yϕ(λt) dyϕ(λt)

vt ˆvθ 2 2 = ϵ ˆϵθ 2 2 (170)

and for the non-trainable encoder

vt ˆvθ + σt (ˆxθ(λt) xϕ(λt)) 2 2 (171)

vt ˆvθ 2 2 = ϵ ˆϵθ 2 2 (172)

Published as a conference paper at ICLR 2024

Table 4: Comparison of the different components of the loss for Diff Enc-8-2, Diff Enc-8-nt and VDMv-8 on MNIST. All quantities are in bits per dimension (BPD), with standard error, 5 seeds, 2M steps. Noise schedules are either fixed or with trainable endpoints.

Model Noise Total Latent Diffusion Reconstruction

VDMv-8 fixed 0.370 0.002 0.0045 0.0 0.360 0.002 0.006 (3 10 5) trainable 0.366 0.001 0.0042 (5 10 5) 0.361 0.003 0.001 (2 10 5) Diff Enc-8-2 fixed 0.367 0.001 0.0009 (3 10 6) 0.360 0.001 0.006 (3 10 5) trainable 0.363 0.002 0.0064 (8 10 5) 0.355 0.002 0.001 (2 10 5) Diff Enc-8-nt fixed 0.378 0.002 1.6 10 5 0.0 0.371 0.002 0.006 (3 10 5) trainable 0.373 0.001 0.0021 (3 10 5) 0.369 0.001 0.002 (5 10 5)

So we get the same objective as for the epsilon parameterization used in (Kingma et al., 2021) in both cases. On the other hand, since σt 1 as t 1, we get for the trainable encoder: vt ˆvθ + σt

ˆxθ(λt) xϕ(λt) + yϕ(λt) dyϕ(λt)

ˆxθ(λt) 2xϕ(λt) + yϕ(λt) dyϕ(λt)

Assuming xϕ(λt) ˆxθ(λt), this loss is small at t 1 if:

ˆvθ xϕ(λt) + yϕ(λt) dyϕ(λt)

= x + σ2 t x σ2 t yϕ(x, λt) + yϕ(λt) dyϕ(λt)

x + x yϕ(λt) + yϕ(λt) dyϕ(λt)

So we are saying that at t = 1, ˆvθ dyϕ(λt)

dλt . Thus the encoder should be able to guide the diffusion model. For the non-trainable encoder, we get

vt ˆvθ + σt (ˆxθ(λt) xϕ(λt)) 2 2 (179)

xϕ(λt) + ˆxθ(λt) + ˆxθ(λt) xϕ(λt) 2 2 (180)

= 2ˆxθ(λt) 2xϕ(λt) 2 2 (181) So in this case, we are just saying that ˆxθ(λt) should be close to xϕ(λt). However, note that since xϕ(λt) = α2 tx, we have that ˆxθ(λt) xϕ(λt) 0 for t = 1. So this is only saying that it should be easy to guess xϕ(λt) 0 for t 1, but it will not help the diffusion model guessing the signal, since there is no signal left in this case.

M DETAILED LOSS COMPARISON FOR DIFFENC AND VDMV ON MNIST

Table 4 shows the average losses of the models trained on MNIST. We see the same pattern as for the small models trained on CIFAR-10: All models with a trainable encoder achieve the same or better diffusion loss than the VDMv model. For the fixed noise schedules the latent loss is always better for the Diff Enc models than for the VDMv, however for the trainable noise schedule, it seems the Diff Enc with a learned encoder sacrifices some latent loss to gain a better diffusion loss.

N DETAILED LOSS COMPARISON FOR DIFFENC-32-2 AND VDMV-32 ON CIFAR-10

To explore the significance of the encoder size, we trained a Diff Enc-32-2, that is, a large diffusion model with a smaller encoder, see Table 5. We see that after 2M steps the diffusion loss for the

Published as a conference paper at ICLR 2024

Table 5: Comparison of the different components of the loss for Diff Enc-32-2 and VDMv-32 with fixed noise schedule on CIFAR-10. All quantities are in bits per dimension (BPD) with standard error over 3 seeds, comparison at 2M steps.

Model Total Latent Diffusion Reconstruction

VDMv-32 2.666 0.002 0.0012 0.0 2.654 0.003 0.01 (4 10 6) Diff Enc-32-2 2.660 0.006 0.0007 (3 10 6) 2.649 0.006 0.01 (2 10 6)

Table 6: Comparison of the different components of the loss for Diff Enc-32-8 and VDMv-32 with fixed noise schedule on Image Net32. All quantities are in bits per dimension (BPD) with standard error over 3 seeds, and models are trained for 1.5M steps.

Model Total Latent Diffusion Reconstruction

VDMv-32 3.461 0.002 0.0014 0.0 3.449 0.002 0.01 (1 10 5) Diff Enc-32-8 3.461 0.002 0.0007 (9 10 7) 3.450 0.002 0.01 (1 10 5)

Diff Enc model is smaller than for the VDMv, however, not significantly so. When inspecting a plot of the losses of the models, the losses seem to be diverging, but one would have to train the Diff Enc-32-2 model for longer to be certain. We did not continue this experiment because of the large compute cost.

O DETAILED LOSS COMPARISON FOR DIFFENC AND VDMV ON IMAGENET32

On imagenet32, we see the same pattern in our experiments as for the small models on CIFAR-10 and MNIST, see Table 6. The diffusion loss is the same for the two models, but the latent loss is better for Diff Enc. Since Image Net is more complex than CIFAR-10, we might need an even larger base diffusion model to achieve a difference in diffusion loss.

P FURTHER FUTURE WORK

Our approach could be combined with various existing methods, e.g., latent diffusion (Vahdat et al., 2021; Rombach et al., 2022) or discriminator guidance (Kim et al., 2022a). If one were to succeed in making the smaller representations from the encoder, one might also combine it with consistency regularization (Sinha & Dieng, 2021) to improve the learned representations.

Q MODEL STRUCTURE AND TRAINING

Code can be found on Git Hub2.

All our diffusion models use the same overall structure with n Res Net blocks, then a middle block of 1 Res Net, 1 self attention and 1 Res Net block, and in the end n more Res Net blocks. We train diffusion models with n = 8 on MNIST and CIFAR-10 and models with n = 32 on CIFAR-10 and Image Net32. All Res Net blocks in the diffusion models preserve the dimensions of the original images (28x28 for MNIST, 32x32 for CIFAR-10, 32x32 for Image Net32) and have 128 out channels for models on MNIST and CIFAR-10 and 256 out channels for models on Image Net32 following (Kingma et al., 2021). We use both a fixed noise schedule with λmax = 13.3 and λmin = 5 and a trainable noise schedule where we learn λmax and λmin.

For our encoder, we use a very similar overall structure as for the diffusion model. Here we have m Res Net blocks, then a middle block of 1 Res Net, 1 self attention and 1 Res Net block, and in the end

2https://github.com/bemigini/Diff Enc

Published as a conference paper at ICLR 2024

Table 7: Comparison of the different components of the loss for Diff Enc-8-4 and VDMv-8 on CIFAR-10 with fixed noise schedule after 1.3M steps. All quantities are in bits per dimension (BPD), with standard error, 3 seeds for Diff Enc-8-4, 5 seeds for VDMv-8.

Model Total Latent Diffusion Reconstruction

VDMv-8 2.794 0.004 0.0012 0.0 2.782 0.004 0.010 (1 10 5) Diff Enc-8-4 2.789 0.002 0.0006 (2 10 6) 2.778 0.002 0.010 (1 10 5)

m more Res Net blocks. However, for the encoder with m = 2, we use maxpooling after each of the first m Res Net blocks and transposed convolution after the last m Res Net blocks, for encoders with m = 4, we use maxpooling after every other of the first m Res Net blocks and transposed convolution after every other of the last m Res Net blocks and for encoders with m = 8, we use maxpooling after every fourth of the first m Res Net blocks and transposed convolution after every fourth of the last m Res Net blocks. Thus, for the encoder we downscale to and upscale from resolutions 14x14 and 7x7 on MNIST and 16x16 and 8x8 on CIFAR-10 and Image Net32.

We do experiments with n = 8, m = 2 on MNIST and CIFAR-10, n = 8, m = 2 and n = 32, m = 4 on CIFAR-10 and n = 32, m = 8 on Image Net32.

We trained 5 seeds for the small models (n = 8), except for the diffusion model size 8 encoder size 4 on CIFAR-10 where we trained 3 seeds. We trained 3 seeds for the large models (n = 32).

For models on MNIST and CIFAR-10 we used a batch size of 128 and no gradient clipping. For models on Image Net32 we used a batch size of 256 and no gradient clipping.

We considered three datasets:

MNIST: The MNIST dataset (Le Cun et al., 1998) as fetched by the tensorflow datasets package3. 60,000 images were used for training and 10,000 images for test. License: Unknown.

CIFAR-10: The CIFAR-10 dataset as fetched from the tensorflow datasets package4. Originally collected by Krizhevsky et al. (2009). 50,000 images were used for training and 10,000 images for test. License: Unknown.

Image Net 32 32: The official downsampled version of Image Net (Chrabaszcz et al., 2017) from the Image Net website: https://image-net.org/download-images. php.

S ENCODER EXAMPLES ON MNIST

Fig. 4 provides an example of the encodings we get from MNIST when using Diff Enc with a learned encoder.

T SAMPLES FROM MODELS

Examples of samples from our large trained models, Diff Enc-32-4 and VDMv-32, can be seen in Fig. 5.

Published as a conference paper at ICLR 2024

Figure 4: Encoded MNIST images from Diff Enc-8-2. Encoded images are close to the identity up to t = 0.7. From t = 0.8 to t = 0.9 the encoder slightly blurs the numbers, and from t = 0.9 it makes the background lighter, but keeps the high contrast in the middle of the image. Intuitively, the encoder improves the latent loss by bringing the average pixel value close to 0.

Figure 5: 100 unconditional samples from a Diff Enc-32-4 (above) and VDMv-32 (below) after 8 million training steps.

U USING A LARGER ENCODER FOR A SMALL DIFFUSION MODEL

As we can observe in Table 7, when the encoder s size is increased, the average diffusion loss is slightly smaller than that of VDM, albeit not significantly. We propose the following two potential explanations for this phenomenon: (1) Longer training may be needed to achieve a significant difference. For Diff Enc-32-4 and VDMv-32, we saw different trends in the loss after about 2 million steps, where the loss of the Diff Enc models decreased more per step. However, it took more training with this trend to achieve a substantial divergence in diffusion loss. (2) a larger diffusion model may be required to fully exploit the encoder.

3https://www.tensorflow.org/datasets/catalog/cifar10 4https://www.tensorflow.org/datasets/catalog/cifar10

Published as a conference paper at ICLR 2024

Table 8: Comparison of the mean FID scores with standard error for Diff Enc-32-4 and VDMv-32 on CIFAR-10 with fixed noise schedule after 8M steps. 3 seeds. We provide both FID scores on 10K and 50K samples and with respect to both train and test set.

Model FID 10K train FID 10K test FID 50K train FID 50K test

VDMv-32 14.8 0.2 18.9 0.2 11.2 0.2 14.9 0.2 Diff Enc-32-4 14.6 0.8 18.5 0.7 11.1 0.8 15.0 0.7

V FID SCORES

Although we did not optimize our model for the visual quality of samples, we provide FID scores of Diff Enc-32-4 and VDMv-32 on CIFAR-10 in Table 8. We see from these, that the FID scores for the two models are similar and that it makes a big difference to the score whether we use the train or test set to calculate it and how many samples we use from the model. The scores are better when using more samples from the model and (as can be expected) better when calculating with respect to the train set that with respect to the test set.

W SUM HEATMAP OF ALL TIMESTEPS

A heatmap over the changes to xt for all timesteps t and all ten CIFAR-10 classes can be found in Fig. 6. Recall in the following that all pixel values are scaled to the range ( 1, 1) before they are given as input to the model. The Diff Enc model with a trainable encoder is initialized with yϕ(λt) = 0, that is, with no contribution from the trainable part. This means that if we had made this heatmap at initialization, the images would be blue where values in the channels are more than 0, red where values are less than 0 and white where values are zero. However, we see that after training, the encoder has a different behaviour around edges for t < 0.8. For example, there is a white line in the middle of the cat in the second row which is not subtracted from, probably to preserve this edge in the image, and there is an extra outline around the whole cat. We also see that for t > 0.8, the encoder gets a much more general behaviour. In the fourth row, we see that the encoder adds to the entire middle of the image including the white line on the horse, which would have been subtracted from, if it had had the same behaviour as at initialisation. Thus, we see that the encoder learns to do something different from how it was initialised, and what it learns is different for different timesteps.

Published as a conference paper at ICLR 2024

Figure 6: Change of encoded image over a range of depths: (xt xs)/(t s) for t = 0.1, ..., 1.0 and s = t 0.1. Changes have been summed over the channels with red and blue denoting positive and negative changes, respectively. For t closer to 0 the changes are finer and seem to be enhancing high-contrast edges, but for t 1 they become more global.