# diffusion_models_with_learned_adaptive_noise__3d1e85bf.pdf Diffusion Models With Learned Adaptive Noise Subham Sekhar Sahoo Cornell Tech, NYC, USA. ssahoo@cs.cornell.edu Aaron Gokaslan Cornell Tech, NYC, USA. akg87@cs.cornell.edu Chris De Sa Cornell University, Ithaca, USA. cdesa@cs.cornell.edu Volodymyr Kuleshov Cornell Tech, NYC, USA. kuleshov@cornell.edu Diffusion models have gained traction as powerful algorithms for synthesizing highquality images. Central to these algorithms is the diffusion process, a set of equations which maps data to noise in a way that can significantly affect performance. In this paper, we explore whether the diffusion process can be learned from data. Our work is grounded in Bayesian inference and seeks to improve log-likelihood estimation by casting the learned diffusion process as an approximate variational posterior that yields a tighter lower bound (ELBO) on the likelihood. A widely held assumption is that the ELBO is invariant to the noise process: our work dispels this assumption and proposes multivariate learned adaptive noise (MULAN), a learned diffusion process that applies noise at different rates across an image. Specifically, our method relies on a multivariate noise schedule that is a function of the data to ensure that the ELBO is no longer invariant to the choice of the noise schedule as in previous works. Empirically, MULAN sets a new state-of-the-art in density estimation on CIFAR-10 and Image Net and reduces the number of training steps by 50%. We provide the code1, along with a blog post and video tutorial on the project page: https://s-sahoo.com/Mu LAN 1 Introduction Diffusion models, inspired by the physics of heat diffusion, have gained traction as powerful tools for generative modeling, capable of synthesizing realistic, high-quality images [51, 16, 43, 14]. Central to these algorithms is the diffusion process, a gradual mapping of clean images into white noise. The reverse of this mapping defines the data-generating process we seek to learn hence, its choice can significantly impact performance [22]. The conventional approach involves adopting a diffusion process derived from the laws of thermodynamics, which, albeit simple and principled, may be suboptimal due to its lack of adaptability to the dataset. In this study, we investigate whether the notion of diffusion can be instead learned from data. Our motivating goal is to perform accurate log-likelihood estimation and probabilistic modelling, and we take an approach grounded in Bayesian inference [23]. We view the diffusion process as an approximate variational posterior: learning this process induces a tighter lower bound (ELBO) on the marginal likelihood of the data. Although previous work argued that the ELBO objective of a diffusion model is invariant to the choice of diffusion process [20, 22], we show that this claim is only true for the simplest types of univariate Gaussian noise: we identify a broader class of noising processes whose optimization yields significant performance gains. 1https://github.com/s-sahoo/Mu LAN 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Figure 1: (Left) Comparison of noise schedule properties: Multivariate Learned Adaptive Noise schedule (MULAN) (ours) versus a typical scalar noise schedule. Unlike scalar noise schedules, MULAN s multivariate and input-adaptive properties improve likelihood. (Right) Likelihood in bits-per-dimension (BPD) on CIFAR-10 without data augmentation. Specifically, we propose a new diffusion process, multivariate learned adaptive noise (Mu LAN), which augments classical diffusion models [51, 20] with three innovations: a per-pixel polynomial noise schedule, an adaptive input-conditional noising process, and auxiliary latent variables. In practice, this method learns the schedule by which Gaussian noise is applied to different parts of an image, and allows tuning this noise schedule to the each image instance. Our learned diffusion process yields improved log-likelihood estimates on two standard image datasets, CIFAR10 and Image Net. Remarkably, we achieve state-of-the-art performance with less than half of the training time of previous methods. Our method also does not require any modifications to the underlying UNet architecture, which makes it compatible with most existing diffusion algorithms. Contributions In summary, our paper makes the following contributions: 1. We demonstrate that the ELBO of a diffusion model is not invariant to the choice of noise process for many types of noise, thus dispelling a common assumption in the field. 2. We introduce MULAN, a learned noise process that adaptively adds multivariate Gaussian noise at different rates across an image in a way that is conditioned on arbitrary context (including the image itself). 3. We empirically demonstrate that learning the diffusion process speeds up training and matches the previous state-of-the-art models using 2x less compute, and also achieves a new state-of-the-art in density estimation on CIFAR-10 and Image Net 2 Background A diffusion process q transforms an input datapoint denoted by x0 and sampled from a distribution q(x0) into a sequence of noisy latent variables xt for t [0, 1] by progressively adding Gaussian noise of increasing magnitude [51, 16, 53]. The marginal distribution of each latent is defined by q(xt|x0) = N(xt; αtx0, σt I) where the diffusion parameters αt, σt R+ implicitly define a noise schedule as a function of t, such that ν(t) = α2 t/σ2 t is a monotonically decreasing function in t. Given any discretization of time into T timesteps of width 1/T, we define t(i) = i/T and s(i) = (i 1)/T and we use x0:1 to denote the subset of variables associated with these timesteps; the forward process q can be shown to factorize into a Markov chain q(x0:1) = q(x0) QT i=1 q(xt(i)|xs(i)). The diffusion model pθ is defined by a neural network (with parameters θ) used to denoise the forward process q. Given a discretization of time into T steps, p factorizes as pθ(x0:1) = pθ(x1) QT i=1 pθ(xs(i)|xt(i)). We treat the xt for t > 0 as latent variables and fit pθ by maximizing the evidence lower bound (ELBO) on the marginal log-likelihood given by: log pθ(x0) = ELBO(pθ, q) + DKL[q(xt(1):t(T )|x0) pθ(xt(1):t(T )|x0)] ELBO(pθ, q) (1) In most works, the noise schedule, as defined by ν(t), is either fixed or treated as a hyperparameter [16, 3, 18]. Chen [3], Hoogeboom et al. [18] show that the noise schedule can have a significant impact on sample quality. Kingma et al. [20] consider learning ν(t), but argue that the KL divergence terms in the ELBO are invariant to the choice of function ν, except for the initial values ν(0), ν(1), and they set these values to hand-specified constants in their experiments. They only consider learning ν for the purpose of minimizing the variance of the gradient of the ELBO. In this work, we show that the ELBO is not invariant to more complex forward processes. 3 Diffusion Models With Multivariate Learned Adaptive Noise Here, we introduce a new diffusion process, multivariate learned adaptive noise (Mu LAN), which introduces three innovations: a per-pixel polynomial noise schedule, a conditional noising process, and auxiliary-variable reverse diffusion. We describe these below. 3.1 Why Learned Diffusion? Our goal is to perform accurate density estimation and probabilistic modelling, and we take an approach grounded in Bayesian inference [23]. Notice that the gap between the evidence lower bound ELBO(p, q) and the marginal log-likelihood (MLL) in Eq. 1 is precisely the KL divergence DKL[q(xt(1):t(T )|x0) pθ(xt(1):t(T )|x0)] between the diffusion process q over the latents xt and the true posterior of the diffusion model. The diffusion process plays the role of a variational posterior q in ELBO(p, q); optimizing q thus tightens the gap (MLL ELBO). This observation suggests that the ELBO can be made tighter by choosing a diffusion processes q that is closer to the true posterior pθ(xt(1):t(T )|x0). In fact, the key idea of variational inference is to optimize maxq Q ELBO(p, q) over a family of approximate posteriors Q to induce a tighter ELBO [23]. Most diffusion algorithms, however optimize maxp P ELBO(p, q) within some family P with a fixed q. Our work seeks to jointly optimize maxp P,q Q ELBO(p, q); we will show in our experiments that this improves the likelihood estimation. The task of log-likelihood estimation is directly motivated by applied problems such as data compression [31]. In that domain, arithmetic coding techniques can take a generative model and produce a compression algorithm that provably achieves a compression rate (in bits per dimension) that equals the model s log-likelihood [4]. Other applications of log-likelihood estimation include adversarial example detection [52], semi-supervised learning [5], and others. Note that our primary focus is density estimation and probabilistic modeling rather than sample quality. The visual appeal of generated images (as measured by e.g., FID) correlates imperfectly with log-likelihood. We focus here on pushing the state-of-the-art in log-likelihood estimation, and while we report FID for completeness, we defer sample quality optimization to future work. 3.2 A Forward Diffusion Process With Multivariate Adaptive Noise Next, our plan is to define a family of approximate posteriors Q, as well as a family suitably matching reverse processes P, such that the optimization problem maxp P,q Q ELBO(p, q) is tractable and does not suffer from the aforementioned invariance to the choice of q. This subsection focuses on defining Q; the next sections will show how to parameterize and train a reverse model p P. Notation. Given two vectors a and b, we use the notation ab to represent the Hadamard product (element-wise multiplication). Additionally, we denote element-wise division of a by b as a / b. We denote the mapping diag(.) that takes a vector as input and produces a diagonal matrix as output. 3.2.1 Multivariate Gaussian Noise Schedule Intuitively, a multivariate noise schedule injects noise at different rates for each pixel of an input image. This enables adapting the diffusion process to spatial variations within the image. We will also see that this change is sufficient to make the ELBO no longer invariant in q. Formally, we define a forward diffusion process with a multivariate noise schedule q via the marginal for each latent noise variable xt for t [0, 1], where the marginal is given by: q(xt|x0) = N(xt; αtx0, diag(σ2 t )), (2) where xt, x0 Rd, αt, σt Rd + and d is the dimensionality of the input data. The αt, σt denote varying amounts of signal associated with each component (i.e., each pixel) of x0 as a function of time t(i). We define the multivariate signal-to-noise ratio as ν(t) = α2 t/σ2 t and choose αt, σt so that ν(t) decreases monotonically in t along all dimensions and is differentiable in t [0, 1]. Let αt|s = αt/αs and σ2 t|s = σ2 t α2 t|s/σ2 s with all operations applied elementwise. These marginals induce transition kernels between steps s < t given by (Suppl. 19): q(xs|xt, x0) = N xs; µq = αt|sσ2 s σ2 t xt + σ2 t|sαs σ2 t x0, Σq = diag σ2 sσ2 t|s σ2 t In Sec. 3.5, we argue that this class of diffusion process Q induces an ELBO that is not invariant to q Q. The ELBO consists of a line integral along the diffusion trajectory specified by ν(t). A line integrand is almost always path-dependent, unless its integral corresponds to a conservative force field, which is rarely the case for a diffusion process [55]. See Sec. 3.5 for details. 3.2.2 Adaptive Noise Schedule Conditioned On Context Next, we extend the diffusion process to support context-adaptive noise. This enables injecting noise in a way that is dependent on the features of an image. Formally, suppose we have access to a context variable c Rm which encapsulates high-level information regarding x0. Examples of c could be a class label, a vector of attributes (e.g., features characterizing a human face), or even the input x0 itself. We define the marginal of the latent xt in the forward process as q(xt|x0, c) = N(xt; αt(c)x0, σ2 t (c)); the reverse process can be similarly derived (Suppl. 19) as: q(xs|xt, x0, c) = N µq = αt|s(c)σ2 s(c) σ2 t (c) xt + σ2 t|s(c)αs(c) σ2 t (c) x0, Σq = diag σ2 s(c)σ2 t|s(c) where the diffusion parameters αt, σt are now conditioned on c via a neural network. Specifically, we parameterize the diffusion parameters αt(c), σt(c), ν(t, c) as α2 t(c) = sigmoid( γϕ(c, t)), σ2 t (c) = sigmoid(γϕ(c, t)), and ν(c, t) = exp ( γϕ(c, t)). Here, γϕ(c, t) : Rm [0, 1] [γmin, γmax]d is a neural network with the property that γϕ(c, t) is monotonic in t. Following Kingma et al. [20], Zheng et al. [65], we set γmin = 13.30, γmax = 5.0. We explore various parameterizations for γϕ(c, t). These schedules are designed in a manner that guarantees γϕ(c, 0) = γmin1d and γϕ(c, 1) = γmax1d, Below, we list these parameterizations. The polynomial parameterization is novel to our work and yields significant performance gains. Monotonic Neural Network [20]. We use the monotonic neural network γvdm(t), proposed in VDM to express γ as a function of t such that γvdm(t) : [0, 1] [γmin, γmax]d. Then we use Fi LM conditioning [38] in the intermediate layers of this network via a neural network that maps z. The activations of the Fi LM layer are constrained to be positive. Polynomial. (Ours) We express γϕ(c, t) as a monotonic degree 5 polynomial in t . Details about the exact functional form of this polynomial and its implementation can be found in Suppl. E.2. 3.3 Auxiliary-Variable Reverse Diffusion Processes In principle, we can fit a normal diffusion model in conjunction with our proposed forward diffusion process. However, variational inference suggests that the variational and the true posterior ought to have the same dependency structure: that is the only way for the KL divergence between these two distributions to be zero. Thus, we introduce a class of approximate reverse processes P that match the structure of Q and that are naturally suitable for joint optimization maxp P,q Q ELBO(p, q). Formally, we define a diffusion model where the reverse diffusion process is conditioned on the context c. Specifically, given any discretization of t [0, 1] into T time steps as in Sec. 2, we introduce a context-conditional diffusion model pθ(x0:1|c) that factorizes as the Markov chain pθ(x0:1|c) = pθ(x1|c) i=1 pθ(xs(i)|xt(i), c). (5) Given that the true reverse process is a Gaussian as specified in Eq. 4, the ideal pθ matches this parameterization (the proof mirrors that of regular diffusion models; Suppl. D), which yields pθ(xs|c, xt) = N µp = αt|s(c)σ2 s(c) σ2 t (c) xt + σ2 t|s(c)αs(c) σ2 t (c) xθ(xt, t), Σp = diag σ2 s(c)σ2 t|s(c)/σ2 t (c) ! where xθ(xt, t), is a neural network that approximates x0. Instead of parameterizing xθ(xt, t) directly using a neural network, we consider two other parameterizations. One is the noise parameterization [16] where ϵθ(xt, c, t) is the denoising model which is parameterized as ϵθ(xt, t) = (xt αt(c)xθ(xt, t, c))/σt(c); see Suppl. E.1.1 and the other is v-parameterization [45] where vθ(xt, c, t) is a neural network that models vθ(xt, c, t) = (αt(c)xt xθ(xt, c, t))/σt(c); see Suppl. E.1.2. 3.3.1 Challenges in Conditioning on Context Note that the model pθ(x0:1|c) implicitly assumes the availability of c at generation time. Sometimes, this context may be available, such as when we condition on a label. We may then fit a conditional diffusion process with a standard diffusion objective Ex0,c[ELBO(x0, pθ(x0:1|c), qϕ(x0:1|c)], in which both the forward and the backward processes are conditioned on c (see Sec. 3.4). When c is not known at generation time, we may fit a model pθ that does not condition on c. Unfortunately, this also forces us to define pθ(xs|xt) = N(µp(xt, t), Σp(xt, t)) where µp(xt, t), Σp(xt, t) is parameterized directly by a neural network. We can no longer use a noise parameterization ϵθ(xt, t) = (xt αt(c)xθ(xt, t, c))/σt(c) because it requires us to compute αt(c) and σt(c), which we do not know. Since noise parameterization plays a key role in the sample quality of diffusion models [16], this approach limits performance. 3.3.2 Conditioning Noise on an Auxiliary Latent Variable We propose an alternative strategy for learning conditional forward and reverse processes p, q that feature the same structure and hence support efficient noise parameterization. Our approach is based on the introduction of auxiliary variables [60], which lift the distribution pθ into an augmented latent space. Experiments (Suppl. D.3) and theory (Suppl. D) confirm that this approach performs better than parameterizing c using a neural network, cθ(xt, t). Specifically, we introduce an auxiliary latent variable z Rm and define a lifted pθ(x, z) = pθ(x|z)pθ(z), where pθ(x|z) is the conditional diffusion model from Eq. 5 (with context c set to z) and pθ(z) is a simple prior (e.g., unit Gaussian or fully factored Bernoulli). The latents z can be interpreted as a high-level semantic representation of x that conditions both the forward and the reverse processes. Unlike x0:1, the z are not constrained to have a particular dimension and can be a low-dimensional vector of latent factors of variation. They can be continuous or discrete. The learning objective for the lifted pθ is given by: log pθ(x0) Eqϕ(z|x0)[log pθ(x0|z)] DKL(qϕ(z|x0) pθ(z)) (7) Eqϕ(z|x0)ELBO(pθ(x0:1|z), qϕ(x0:1|z)) DKL(qϕ(z|x0) pθ(z)), (8) where ELBO(pθ(x0:1|z), qϕ(x0:1|z)) denotes the variational lower bound (VLB) of a diffusion model (defined in Eq. 1) with a forward process qϕ(x0:1|z) (defined in Eq. 4 and Sec. 3.2.2) and and an approximate reverse process pθ(x0:1|z) (defined in Eq. 5), both conditioned on z. The distribution qϕ(z|x0) is an approximate posterior for z parameterized by a neural network with parameters ϕ. Crucially, note that in the learning objective (Eq. 8), the context, which in this case is z, is available at training time in both the forward and reverse processes. At generation time, we can still obtain a valid context vector by sampling an auxiliary latent from pθ(z). Thus, this approach addresses the aforementioned challenges and enables us to use the noise parameterization in Eq. 6. Although we apply Jensen s inequality twice to get (8), this also enables us to learn the noise process, which significantly offsets any potential increase in ELBO gap reduction and improves ELBO(pθ(x0:1|z), qϕ(x0:1|z)) by optimizing over a more expressive class of posteriors. This claim is empirically validated in Table 2. 3.4 Variational Lower Bound Next, we derive a precise formula for the learning objective (8) of the auxiliary-variable diffusion model. Using the objective of a diffusion model in (1) we can write (8) as the sum of four terms: log pθ(x0) Eqϕ[Lrecons + Ldiffusion + Lprior + Llatent], (9) The reconstruction loss, Lrecons, can be (stochastically and differentiably) estimated using standard techniques; see [23], Lprior = DKL[qϕ(x1|x0, z) pθ(x1)] is the diffusion prior term, Llatent = DKL[qϕ(z|x0) pθ(z)] is the latent prior term, and Ldiffusion is the diffusion loss term, which we examine below. The complete derivation is given in Suppl. E.3. 3.4.1 Diffusion Loss Discrete-Time Diffusion. We start by defining pθ in discrete time, and as in Sec. 2, we let T > 0 be the number of total time steps and define t(i) = i/T and s(i) = (i 1)/T as indexing variables over the time steps. We also use x0:1 to denote the subset of variables associated with these timesteps. Starting with the expression in Eq. 1 and following the steps in Suppl. E, we can write Ldiffusion as: Ldiffusion = i=2 DKL[qϕ(xs(i)|xt(i), x0, z) pθ(xs(i)|xt(i), z)] i=2 [(ϵt ϵθ(xt, z, t(i))) diag (γ(z, s(i)) γ(z, t(i))) (ϵt ϵθ(xt, z, t(i)))] (10) Continuous-Time Diffusion. We can also consider the limit of the above objective as we take an infinitesimally small partition of t [0, 1], which corresponds to the limit when T . In Suppl. E we show that taking this limit of Eq. 10 yields the continuous-time diffusion loss: Ldiffusion = 1 2Et [0,1][(ϵt ϵθ(xt, z, t)) diag ( tγ(z, t)) (ϵt ϵθ(xt, z, t))] (11) where tγ(z, t) Rd denotes the Jacobian of γ(z, t) with respect to the scalar t. We observe that the limit of T yields improved performance, matching the existing theoretical argument by Kingma et al. [20]. 3.4.2 Auxiliary latent loss We try two different kinds of priors for pθ(z): discrete (z {0, 1}m) and continuous (z Rm). Continuous Auxiliary Latents. In the case where z is continuous, we select pθ(z) as N(0, Im). This leads to the following KL loss term: DKL(qϕ(z|x0) pθ(z)) = 1 2(µ (x0)µ(x0)) + tr(Σ2(x0) Im) log |Σ2(x0)|). Discrete Auxiliary Latents. In the case where z is discrete, we select pθ(z) as a uniform distribution. Let z {0, 1}m be a k-hot vector sampled from a discrete Exponential Family distribution pθ(z; θ) with logits θ. Niepert et al. [34] show that z pθ(z; θ) is equivalent to z = arg maxy Y θ + ϵg, y where ϵg denotes the sum of gamma distribution Suppl. F, Y denotes the set of all k-hot vectors of some fixed length m. For k > 1, To differentiate through the arg max we use a relaxed estimator, Identity, as proposed by Sahoo et al. [44]. This leads to the following KL loss term: DKL(qϕ(z|x0) pθ(z)) = Pm i=1 qϕ(z|x0)i(log qϕ(z|x0)i + log m). 3.5 The Variational Lower Bound as a Line Integral Over The Noise Schedule Having defined our loss, we now return to the question of whether it is invariant to the choice of diffusion process. Notice that we may rewrite Eq. 11 in the following vectorized form: Ldiffusion = 1 0 (x0 xθ(xt, z, t))2 tν(z, t)dt (12) where the square is applied elementwise. We seek to rewrite (12) as a line integral R b a f(r(t)) d dtr(t)dt for some vector field f and trajectory r(t). Recall that ν(z, t) is monotonically decreasing in each coordinate as a function of t; hence, it is invertible on its image, and we can write t = ν 1 z (ν(z, t)) for some ν 1 z . Let xθ(xν(z,t), z, ν(z, t)) xθ(xν 1 z (ν(z,t)), z, ν 1 z (ν(z, t))) and note that for all t, we can write xt as xν(z,t); see Eq. 30, and have xθ(xν(z,t), z, ν(z, t)) xθ(xt, z, t). We can then write the integral in (12) as R 1 0 (x0 xθ(xν(z,t), z, ν(z, t)))2 d dtν(z, t) dt, which is a line integral with f(r(t)) (x0 xθ(xν(z,t), z, ν(z, t)))2 and r(t) ν(z, t). Intuitive explanation. Imagine piloting a plane across a region with cyclones and strong winds, as shown in Fig. 5. Plotting a direct, straight-line course through these adverse weather conditions requires more fuel and effort due to increased resistance. By navigating around the cyclones and winds, however, the plane reaches its destination with less energy, even if the route is longer. This intuition translates into mathematical and physical terms. The plane s trajectory is denoted by r(t) Rn +, while the forces acting on it are represented by f(r(t)) Rn. The work required to navigate is given by R 1 0 f(r(t)) d dtr(t), dt. Here, the work depends on the trajectory because f(r(t)) is not a conservative field. This concept also applies to the diffusion NELBO. From Eq. 12, it s clear that the trajectory r(t) is parameterized by the noise schedule ν(z, t), which is influenced by complex forces, f (analogous to weather patterns), represented by the dimension-wise reconstruction error of the denoising model, (x0 xθ(xt, z, t))2. Thus, the diffusion loss, Ldiffusion, can be interpreted as the work done along the trajectory ν(z, t) in the presence of these vector field forces f. By learning the noise schedule, we can avoid high-resistance paths (those where the loss accumulates rapidly), thereby minimizing the overall energy expended, as measured by the NELBO. Since the diffusion process corresponds to non-conservative force fields, as noted in Spinney & Ford [55], different noise schedules should yield different NELBOs a result supported by our empirical findings. In Suppl. E.5, we show that variational diffusion models are limited to linear trajectories ν(t), rendering their objective invariant to the noise schedule. In contrast, our approach learns a multivariate ν, enabling paths that achieve a better ELBO. 4 Experiments This section reports experiments on the CIFAR-10 [25] and Image Net-32 [58] datasets. We don t employ data augmentation and we use the same architecture and settings as in the VDM model [20]. The encoder, qϕ(z|x), is modeled using a sequence of 4 Res Net blocks which is much smaller than the denoising network that uses 32 such blocks (i.e., we increase parameter count by only about 10%); the noise schedule γϕ is modeled using a two-layer MLP. In all our experiments, we use discrete auxiliary latents with m = 50 and k = 15. A detailed description can be found in Suppl. G. 4.1 Training Speed In these experiments, we replace VDM s noise process with MULAN. On CIFAR-10, MULAN attains VDM s likelihood score of 2.65 in just 2M steps, compared to VDM s 10M steps 1). When trained on 4 V100 GPUs, VDM achieves a training rate of 2.6 steps/second, while MULAN trains slightly slower at 2.24 steps/second due to the inclusion of an additional encoder network. However, despite this slower training pace, VDM requires 30 days to reach a BPD of 2.65, whereas Mulan achieves the same BPD within a significantly shorter timeframe of 10 days. On Image Net-32, VDM integrated with MULAN reaches a likelihood of 3.71 in half the time, achieving this score in 1M steps versus the 2M steps required by VDM. 4.2 Likelihood Estimation In Table 2, we also compare MULAN with other recent methods on CIFAR-10 and Image Net32. MULAN was trained using v-parameterization for 8M steps on CIFAR-10 and 2M steps on Imagenet-32. During inference, we extract the underlying probability flow ODE and use it to estimate the log-likelihood; see Suppl. I.2. Our algorithm establishes a new state-of-the-art in density estimation on both Image Net-32 and CIFAR-10. In Table 8, we also compute variational lower Table 1: Likelihood in bits per dimension (BPD) based on the Variational Lower Bound (VLB) estimate (Suppl. I.1), sample quality (FID scores) and number of function evaluations (NFE) on CIFAR-10, for vanilla VDM and VDM when endowed with MULAN. FID and NFE were computed for 10k samples generated using an adaptive-step ODE solver. Both methods use noise parameterization (Suppl. E.1.1). Model CIFAR-10 Image Net Steps VLB ( ) FID ( ) NFE ( ) Steps VLB ( ) FID ( ) NFE ( ) VDM [20] 10M 2.65 23.91 56 2M 3.72 14.26 56 + MULAN 2M 2.65 18.54 55 1M 3.72 15.00 62 + MULAN 10M 2.60 17.62 50 2M 3.71 13.19 62 Table 2: Likelihood in bits per dimension (BPD) on the test set of CIFAR-10 and Image Net. Results with / means they are not reported in the original papers. Model types are autoregressive (AR), normalizing flows (Flow), diffusion models (Diff). We only compare with results achieved without data augmentation. Model Type CIFAR-10 ( ) Image Net ( ) Pixel CNN [57] AR 3.03 3.83 Image Transformer [35] AR 2.90 3.77 DDPM [16] Diff 3.69 / Score Flow [54] Diff 2.83 3.76 VDM [20] Diff 2.65 3.72 Flow Matching [28] Flow 2.99 / Reflected Diffusion Models [30] Diff 2.68 3.74 MULAN (Ours) Diff 2.55 10 3 3.67 10 3 bounds (VLBs) of 2.59 and 3.71 on CIFAR-10 and Image Net, respectively. Each bound improves over published results (Table 2); our true NLLs (via flow ODEs) are even lower. 4.3 Alternative Learned Diffusion Methods Table 3: Likelihood in bits per dimension (bpd) on CIFAR-10 for learned diffusion methods. Model NLL ( ) DNF [64] 3.04 NDM [1] 2.70 Diff Enc [33] 2.62 MULAN 2.55 Concurrent work that seeks to improve log-likelihood estimation by learning the forward diffusion process includes Neural Diffusion Models (NDMs) [1] and Diff Enc [33]. In NDMs, the noise schedule is fixed, but the mean of each marginal q(xt|x0) is learned, while Diff Enc adds a correction term to q. Diffusion normalizing flows (DNFs) represent an earlier effort where q is a normalizing flow trained by backpropagating through sampling. In Table 3, we compare against NDMs, Diff Enc, and DNFs on the CIFAR-10 dataset, using the authors published results; note that their published Image Net numbers are either not available or are reported on a different dataset version that is not comparable. Our approach to learned diffusion outperforms previous and concurrent work. 4.4 Ablation Analysis And Additional Experiments Due to the expensive cost of training, we only performed ablation studies on CIFAR-10 with a reduced batch size of 64 and trained the model for 2.5M training steps. In Fig. 2a we ablate each component of MULAN: when we remove the conditioning on an auxiliary latent space from MULAN so that we have a multivariate noise schedule that is solely conditioned on time t, our performance becomes comparable to that of VDM, on which our model is based. Changing to a scalar noise schedule based on latent variable z initially underperforms compared to VDM. This drop aligns with our likelihood formula (Eq. 6) which includes DKL(qϕ(z|x0)|pθ(z)), an extra term not in VDM. The input-conditioned scalar schedule doesn t offer any advantage over the scalar schedule used in VDM. This is due to the reasons outlined in Sec. 3.5. 0 0.5M 1M 1.5M 2M 2.5M Iterations Test loss (bits / dim) Mu LAN Mu LAN w/o aux. latent Mu LAN w/o multivariate Mu LAN w/o adaptivity (a) In MULAN w/o aux. latent, the noise isn t conditioned on a latent. MULAN w/o multivariate uses a scalar noise schedule. MULAN w/o adaptivity has a linear schedule and no auxiliary latents. 0 0.5M 1M 1.5M 2M 2.5M Iterations Test loss (bits / dim) Polynomial Monotone NN Linear (b) MULAN with different noise schedule parameterizations: polynomial, monotonic neural network, and linear. Our proposed polynomial parameterization performs the best. Figure 2: Ablating components of MULAN on CIFAR-10 over 2.5M steps with batch size of 64. Perceptual Quality While perceptual quality is not the focus of this work, we report FID numbers for the VDM model and Mu LAN (Table 1). We use RK45 ODE solver to generate samples by solving the reverse time Flow ODE (Eq. 76). We observe that Mu LAN does not degrade FIDs, while improving log-likelihood estimation. Note that Mu LAN does not incorporate many tricks that improve FID such as exponential moving averages, truncations, specialized learning schedules, etc.; our FID numbers can be improved in future work using these techniques. Loss curves for different noise schedules. We investigate different parameterizations of the noise schedule in Fig. 2b. Among polynomial, linear, and monotonic neural network, we find that the polynomial parameterization yields the best performance. The polynomial noise schedule is a novel component introduced in our work. The reason why a polynomial function works better than a linear or a monotonic neural network as proposed by VDM is rooted in Occam s razor. In Suppl. E.2, we show that a degree 5 polynomial is the simplest polynomial that satisfies several desirable properties, including monotonicity and having a derivative that equals zero exactly twice. More expressive models (e.g., monotonic 3-layer MLPs) are more difficult to optimize. 0.0 0.2 0.5 0.8 1.0 t varz (z, t) CIFAR-10: Each curve corresponds to a pixel Figure 3: Noise schedule visualizations for MULAN on CIFAR-10. In this figure, we plot the variance of νϕ(z, t) across different z pθ(z) where each curve represents the SNR corresponding to an input dimension. Examining the noise schedule. Since the noise schedule, γϕ(z, t) is multivariate, we expect to learn different noise schedules for different input dimensions and different inputs z pθ(z). In Fig. 3, we take our best trained model on CIFAR-10 and visualize the variance of the noise schedule at each point in time for different pixels, where the variance is taken on 128 samples z pθ(z). We note an increased variation in the early portions of the noise schedule. However, on an absolute scale, the variance of this noise is smaller than we expected. We also tried to visualize noise schedules across different dataset images and across different areas of the same image; refer to Fig. 13. We also generated synthetic datasets in which each datapoint contained only high frequencies or only low frequencies, and with random masking applied to parts of the data points; see Suppl. H. Surprisingly, none of these experiments revealed human-interpretable patterns in the learned schedule, although we did observe clear differences in likelihood estimation. We hypothesize that other architectures and other forms of conditioning may reveal interpretable patterns of variation; however, we leave this exploration to future work. Replacing the noise schedules in a trained denoising model. We also confirm experimentally our claim that the learning objective is not invariant to the multivariate noise schedule. We replace the noise schedule in the trained denoising model with two alternatives: MULAN with scalar noise schedule, and a linear noise schedule: γϕ(z, t) = γmin + t(γmax γmin)1d; see Kingma et al. [20]. For both the noise schedules the likelihood reduces to the same value as that of the VDM: 2.65. 5 Related Work Diffusion models have emerged in recent years as powerful tools for modeling complex distributions [51, 16], extending flow-based methods [53, 24, 48, 49] The noise schedule, which determines the amount and type of noise added at each step, plays a critical role in diffusion models. Chen [3] empirically demonstrate that different noise schedules can significantly impact the generated image quality using various handcrafted noise schedules. Kingma et al. [20] showed that the likelihood of a diffusion model remains invariant to the noise schedule with a scalar noise schedule. In this work we show that the ELBO is no longer invariant to multivariate noise schedules. Recent works explored multivariate noise schedules (including blurring, masking, etc.) [17, 42, 36, 12], yet none have delved into learning the noise schedule conditioned on the input data itself. Likewise, conditional noise processes are typically not learned [26, 39, 62] and their conditioner (e.g., a prompt) is always available. Auxiliary variable models [63, 60] add semantic latents in p, but not in q, and they don t condition or learn q. In contrast, we learn multivariate noise conditioned on latent context. Diffusion normalizing flows (DNFs) [64] learn a q parameterized by a normalizing flow; however, such q do not admit tractable marginals and require sampling full data-to-noise trajectories from q, which is expensive. Concurrent work on neural diffusion models (NDMs) and Diff Enc admits tractable marginals q with learned means and univariate schedules; this yields more expressive q than ours but requires computing losses in a modified space that precludes using a noise parameterization and certain sampling strategies. Empirically, Mu LAN performs better with fewer parameters (Suppl. A). Optimal transport techniques seek to learn a noise process that minimizes the transport cost from data to noise, which in practice produces smoother diffusion trajectories that facilitate sampling. Schrondinger bridges [47, 6, 59, 37] learn expressive q do not admit analytical marginals, require computing full data-to-noise trajectories and involve iterative optimization (e.g., sinkhorn), which can be slow. Rectification [27] seeks diffusion paths that are close to linear; this improves sampling, while our method chooses paths that improve log-likelihood. See Suppl. A for more detailed comparisons. 6 Conclusion We introduced MULAN, a context-adaptive noise process that applies Gaussian noise at varying rates across input data. Our theory challenges the prevailing notion that the likelihood of diffusion models is independent of the noise schedule: this independence only holds true for univariate schedules. Our evaluation of MULAN spans multiple image datasets, where it outperforms state-of-the-art generative diffusion models. We hope our work will motivate further research into the design of noise schedules, not only for improving likelihood estimation but also to improve image quality generation [35, 53]. A stronger fit to the data distribution also holds promise for improving downstream applications of generative modeling, e.g., compression or decision-making [32, 9, 8, 41]. Acknowledgments and Disclosure of Funding This work was partially funded by Volodymyr Kuleshov s the National Science Foundation under awards DGE-1922551, CAREER awards 2046760 and 2145577, and the National Institute of Health under award MIRA R35GM151243, and by Christopher De Sa s NSF RI-CAREER award 2046760. [1] Grigory Bartosh, Dmitry Vetrov, and Christian A Naesseth. Neural diffusion models. ar Xiv preprint ar Xiv:2310.08337, 2023. [2] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018. [3] Ting Chen. On the importance of noise scheduling for diffusion models. ar Xiv preprint ar Xiv:2301.10972, 2023. [4] Thomas M Cover and Joy A Thomas. Data compression. Elements of Information Theory, pp. 103 158, 2005. [5] Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Russ R Salakhutdinov. Good semi-supervised learning that requires a bad gan. Advances in neural information processing systems, 30, 2017. [6] Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695 17709, 2021. [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248 255, 2009. doi: 10.1109/CVPR.2009.5206848. [8] Shachi Deshpande and Volodymyr Kuleshov. Calibrated uncertainty estimation improves bayesian optimization, 2023. [9] Shachi Deshpande, Kaiwen Wang, Dhruv Sreenivas, Zheng Li, and Volodymyr Kuleshov. Deep multi-modal structural equations for causal effect estimation with unstructured proxies. Advances in Neural Information Processing Systems, 35:10931 10944, 2022. [10] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780 8794, 2021. [11] J.R. Dormand and P.J. Prince. A family of embedded runge-kutta formulae. Journal of Computational and Applied Mathematics, 6(1):19 26, 1980. ISSN 0377-0427. doi: https://doi. org/10.1016/0771-050X(80)90013-3. URL https://www.sciencedirect.com/science/ article/pii/0771050X80900133. [12] Weitao Du, He Zhang, Tao Yang, and Yuanqi Du. A flexible diffusion model. In International Conference on Machine Learning, pp. 8678 8696. PMLR, 2023. [13] Pavlos S. Efraimidis and Paul G. Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181 185, 2006. ISSN 0020-0190. doi: https://doi.org/ 10.1016/j.ipl.2005.11.003. URL https://www.sciencedirect.com/science/article/ pii/S002001900500298X. [14] Aaron Gokaslan, A Feder Cooper, Jasmine Collins, Landan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, and Volodymyr Kuleshov. Commoncanvas: Open diffusion models trained on creative-commons images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8250 8260, 2024. [15] Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367, 2018. [16] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. [17] Emiel Hoogeboom and Tim Salimans. Blurring diffusion models. ar Xiv preprint ar Xiv:2209.05557, 2022. [18] Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. ar Xiv preprint ar Xiv:2301.11093, 2023. [19] Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059 1076, 1989. [20] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696 21707, 2021. [21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [22] Diederik P Kingma and Ruiqi Gao. Understanding the diffusion objective as a weighted integral of elbos. ar Xiv preprint ar Xiv:2303.00848, 2023. [23] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [24] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018. [25] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [26] Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, and Tie-Yan Liu. Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior. ar Xiv preprint ar Xiv:2106.06406, 2021. [27] Sangyun Lee, Beomsu Kim, and Jong Chul Ye. Minimizing trajectory curvature of ode-based generative models. In International Conference on Machine Learning, pp. 18957 18973. PMLR, 2023. [28] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. ar Xiv preprint ar Xiv:2210.02747, 2022. [29] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017. [30] Aaron Lou and Stefano Ermon. Reflected diffusion models. In International Conference on Machine Learning, pp. 22675 22701. PMLR, 2023. [31] David JC Mac Kay. Information theory, inference and learning algorithms. Cambridge university press, 2003. [32] Tung Nguyen and Aditya Grover. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 16569 16594. PMLR, 2022. URL https://proceedings. mlr.press/v162/nguyen22b.html. [33] Beatrix MG Nielsen, Anders Christensen, Andrea Dittadi, and Ole Winther. Diffenc: Variational diffusion with a learned encoder. ar Xiv preprint ar Xiv:2310.19789, 2023. [34] Mathias Niepert, Pasquale Minervini, and Luca Franceschi. Implicit MLE: backpropagating through discrete exponential family distributions. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 14567 14579, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ 7a430339c10c642c4b2251756fd1b484-Abstract.html. [35] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In International conference on machine learning, pp. 4055 4064. PMLR, 2018. [36] Naama Pearl, Yaron Brodsky, Dana Berman, Assaf Zomet, Alex Rav Acha, Daniel Cohen-Or, and Dani Lischinski. Svnr: Spatially-variant noise removal with denoising diffusion. ar Xiv preprint ar Xiv:2306.16052, 2023. [37] Stefano Peluchetti. Diffusion bridge mixture transports, schrödinger bridge problems and generative modeling. Journal of Machine Learning Research, 24(374):1 51, 2023. [38] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. [39] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Gradtts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pp. 8599 8608. PMLR, 2021. [40] Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [41] Richa Rastogi, Yair Schiff, Alon Hacohen, Zhaozhi Li, Ian Lee, Yuntian Deng, Mert R. Sabuncu, and Volodymyr Kuleshov. Semi-parametric inducing point networks and neural processes. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=FE99-f Dr Wd5. [42] Severi Rissanen, Markus Heinonen, and Arno Solin. Generative modelling with inverse heat dissipation. ar Xiv preprint ar Xiv:2206.13397, 2022. [43] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021. [44] Subham Sekhar Sahoo, Anselm Paulus, Marin Vlastelica, Vít Musil, Volodymyr Kuleshov, and Georg Martius. Backpropagation through combinatorial algorithms: Identity with projection works. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=JZMR727O29. [45] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. ar Xiv preprint ar Xiv:2202.00512, 2022. [46] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. ar Xiv preprint ar Xiv:1701.05517, 2017. [47] Yuyang Shi, Valentin De Bortoli, Andrew Campbell, and Arnaud Doucet. Diffusion schrödinger bridge matching. Advances in Neural Information Processing Systems, 36, 2024. [48] Phillip Si, Allan Bishop, and Volodymyr Kuleshov. Autoregressive quantile flows for predictive uncertainty estimation. In International Conference on Learning Representations, 2022. [49] Phillip Si, Zeyi Chen, Subham Sekhar Sahoo, Yair Schiff, and Volodymyr Kuleshov. Semiautoregressive energy flows: exploring likelihood-free training of normalizing flows. In International Conference on Machine Learning, pp. 31732 31753. PMLR, 2023. [50] John Skilling. The eigenvalues of mega-dimensional matrices. 1989. URL https://api. semanticscholar.org/Corpus ID:117844915. [51] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics, 2015. [52] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. ar Xiv preprint ar Xiv:1710.10766, 2017. [53] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020. [54] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. Advances in neural information processing systems, 34: 1415 1428, 2021. [55] Richard E. Spinney and Ian J. Ford. Fluctuation relations: a pedagogical overview, 2012. [56] Benigno Uria, Iain Murray, and Hugo Larochelle. Rnade: The real-valued neural autoregressive density-estimator. Advances in Neural Information Processing Systems, 26, 2013. [57] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016. [58] Aäron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International conference on machine learning, pp. 1747 1756. PMLR, 2016. [59] Gefei Wang, Yuling Jiao, Qian Xu, Yang Wang, and Can Yang. Deep generative learning via schrödinger bridge. In International conference on machine learning, pp. 10794 10804. PMLR, 2021. [60] Yingheng Wang, Yair Schiff, Aaron Gokaslan, Weishen Pan, Fei Wang, Christopher De Sa, and Volodymyr Kuleshov. Infodiffusion: Representation learning using information maximizing diffusion models. In International Conference on Machine Learning, pp. xxxx xxxx. PMLR, 2023. [61] Sang Michael Xie and Stefano Ermon. Reparameterizable subset sampling via continuous relaxations. ar Xiv preprint ar Xiv:1901.10517, 2019. [62] Ling Yang, Zhilong Zhang, Zhaochen Yu, Jingwei Liu, Minkai Xu, Stefano Ermon, and Bin Cui. Cross-modal contextualized diffusion models for text-guided visual generation and editing. ar Xiv preprint ar Xiv:2402.16627, 2024. [63] Ruihan Yang and Stephan Mandt. Lossy image compression with conditional diffusion models, 2023. [64] Qinsheng Zhang and Yongxin Chen. Diffusion normalizing flow. Advances in Neural Information Processing Systems, 34:16280 16291, 2021. [65] Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Improved techniques for maximum likelihood estimation for diffusion odes. ar Xiv preprint ar Xiv:2305.03935, 2023. 1 Introduction 1 2 Background 2 3 Diffusion Models With Multivariate Learned Adaptive Noise 3 3.1 Why Learned Diffusion? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 A Forward Diffusion Process With Multivariate Adaptive Noise . . . . . . . . . . 3 3.3 Auxiliary-Variable Reverse Diffusion Processes . . . . . . . . . . . . . . . . . . . 4 3.4 Variational Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.5 The Variational Lower Bound as a Line Integral Over The Noise Schedule . . . . . 6 4 Experiments 7 4.1 Training Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.2 Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.3 Alternative Learned Diffusion Methods . . . . . . . . . . . . . . . . . . . . . . . 8 4.4 Ablation Analysis And Additional Experiments . . . . . . . . . . . . . . . . . . . 8 5 Related Work 10 6 Conclusion 10 Appendices 16 Appendix A Comparing to Previous Work 16 A.1 Diffusion Models with Custom Noise . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 Advanced Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3 Learned Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.4 Optimal Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Appendix B Standard Diffusion Models 20 B.1 Forward Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.2 Reverse Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.3 Variational Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 B.4 Diffusion Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Appendix C Multivariate noise schedule 22 C.1 Forward Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.2 Reverse Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C.3 Diffusion Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C.4 Vectorized Representation of the diffusion loss . . . . . . . . . . . . . . . . . . . . 24 C.5 Log likelihood and Noise Schedules: A Thermodynamics perspective . . . . . . . 24 Appendix D Multivariate noise schedule conditioned on context 24 D.1 context is available during the inference time. . . . . . . . . . . . . . . . . . . . . 25 D.2 context isn t available during the inference time. . . . . . . . . . . . . . . . . . . 25 D.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Appendix E MULAN: MUltivariate Latent Auxiliary variable Noise Schedule 29 E.1 Parameterization in the reverse process . . . . . . . . . . . . . . . . . . . . . . . . 29 E.2 Polynomial Noise Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 E.3 Variational Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 E.4 Diffusion Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 E.5 Recovering VDM from the Vectorized Representation of the diffusion loss . . . . . 32 Appendix F Subset Sampling 33 Appendix G Experiment Details 34 G.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 G.2 Hardware. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 G.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Appendix H Datasets and Visualizations 34 H.1 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 H.2 Image Net-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 H.3 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 H.4 Frequency-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 H.5 CIFAR-10: Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 H.6 Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Appendix I Likelihood Estimation 42 I.1 VLB Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 I.2 Exact likelihood computation using Probability Flow ODE . . . . . . . . . . . . . 42 Appendix A Comparing to Previous Work MULAN is the first method to introduce a learned adaptive noise process. A widely held assumption is that the ELBO objective of a diffusion model is invariant to the noise process [20]. We dispel this assumption: we show that when input-conditioned noise is combined with (a) multivariate noise, (b) a novel polynomial parameterization, and (c) auxiliary variables, a learned noise process yields an improved variational posterior and a tighter ELBO. This approach sets a new state-of-the-art in density estimation. While (a), (c) were proposed in other contexts, we leverage them as subcomponents of a novel algorithm. We elaborate further on this below. A.1 Diffusion Models with Custom Noise The noise schedule, which determines the amount and type of noise added at each step, plays a critical role in diffusion models. Chen [3] empirically demonstrate that different noise schedules can significantly impact the generated image quality using various handcrafted noise schedules. Kingma et al. [20] showed that the likelihood of a diffusion model remains invariant to the noise schedule with a scalar noise schedule. In this work we show that the ELBO is no longer invariant to multivariate noise schedules. Recent works, including Hoogeboom & Salimans [17], Rissanen et al. [42], Pearl et al. [36], have explored per-pixel noise schedules (including blurring and other types of noising), yet none have delved into learning or conditioning the noise schedule on the input data itself. The shared components among these models are summarized and compared in Table 4. A.2 Advanced Diffusion Models Yang et al. [62] proposes noise processes that are conditioned on an external context (e.g., text). We also propose context-conditioned noise processes; however, their setting is that of conditional generation, where the context is always available at training and inference time, and the context represents external data. Our paper instead looks at unconditional generation, and we condition the noising process on the image itself that we want to generate (via latent variable) and learn how to apply noise across an image as a function of the image. Lee et al. [26], Popov et al. [39] proposed using a data-dependent prior: however, they do not learn q and their noise process is not adaptive to the input x0. Thus they propose a fairly different set of methods from what we introduce. Yang & Mandt [63], Wang et al. [60] have explored diffusion models with an auxiliary latent space, where the denoising network is conditioned on a latent distribution. Our paper also incorporate auxiliary latents, but unlike previous works, we add them to both p and q and we also also focus on learning the process q (as opposed to doing representation learning using the auxiliary learned space). Lastly, our algorithm relies on many other components including a custom noise schedule, multivariate noise, etc. The shared components among these models are summarized and compared in Table 4. Table 4: MULAN is a noise schedule that can be integrated into any diffusion model such as VDM [20] or Info Diffusion [60]. The shared components between MULAN and these models are summarized and compared in this table. Method learned noise multivariate noise input conditioned noise auxiliary latents noise schedule VDM [20] Yes No No No Monotonic neural network Blurring Diffusion Model [17] No Yes No No Frequency scaling Inverse Heat Dissipation [42] No Yes No No Exponential SVNR [36] No Yes No No Linear Info Diffusion [60] No No No In denoising process Cosine Lossy Compression [63] No No No In denoising process Linear / Cosine MULAN (Ours) Yes Yes Yes In noising and denoising process A.3 Learned Diffusion Diffusion Normalizing Flow (DNF) uses the following forward process: dxt = fθ(xt, t)dt + g(t)dw, (13) where the drift term fθ : Rd R Rd is parameterized by a neural network with parameters θ and the diffusion term g(t) R+ is a scalar constant and w is the standard Brownian motion. However, in Mu LAN, the forward process is given by dxt = fθ(z, t) xtdt + gθ(z, t) dw; z qϕ(z|x0), (14) where z {0, 1}m is the auxiliary latent vector, fθ : Rm R Rd and gθ : Rm R Rd are parameterized by a neural network. Notice that the drift term in DNF, fθ(x, t), is a non-linear function in x0, and the same holds for Mu LAN since in the drift term, fθ(z, t) x, z and x depend on x0. Additionally, the diffusion coefficient, gθ(z, t), is multivariate and conditioned on x0 via z. The two parameterizations are different: on one hand, DNF admits more general classes of neural networks because it does not require marginals to be tractable. On the other hand Mu LAN admits a more flexible noise model gθ(z, t) and admits more efficient training (see the summarized Table 5 below). Mu LAN has the advantage that it is simulation free; which means that given a data x0, the noisy sample xt can be computed in closed form; however, in Diffusion Normalizing Flow, to compute xt, one needs to simulate the forward SDE which is resource intensive and limits its scalability to larger denoising models. While Mu LAN optimizes the ELBO, DNF optimizes an approximation for the ELBO. In particular, the DNF training objective does not involve a term that accounts for the entropy of the encoder. Thus, the objective is closer to that of a normal auto-encoder in that regard. Table 5: The key differences between MULAN and DNF is listed below. Aspect Property Diffusion Normalizing Flow MULAN Multivariate Yes Yes Adaptive Yes Yes Learnable Yes Yes Diffusion Term Multivariate No Yes Adaptive No Yes Learnable No Yes Simulation Free Training No Yes Exact ELBO Optimization No Yes NLL ( ) CIFAR-10 3.04 2.55 Other concurrent work seeks to improve log-likelihood estimation by learning the forward diffusion process in a simulation-free setting. In neural diffusion models (NDMs), the noise schedule is fixed, but the mean of each marginal q(xt|x0) is learned. This requires denoising x in a transformed space, which prevents using noise parameterization, a design choice that is important for performance. Their denoising family also induces a parameterization that limits the kinds of samplers that can be sued. Lastly, NDMs use a model that is 2x larger than a regular diffusion model, while ours only adds 10% more parameters. The Diff Enc framework adds an extra learned correction term to q to adjust the mean of each marginal q(xt|x0). This noise choice also requires using certain parameterizations for x that are not compatible with noise parameterization; while their approach supports v-parameterization, it also requires training a mean parameterization network. Similarly to NDMs, the noise schedule remains fixed, while the mean of each marginal is adjusted by the network. Our approach towards learning the noise schedule yields better empirical performance and is, in our opinion, simpler; it can also be combined with this prior work on learning the marginals means. A.4 Optimal Transport In techniques based on optimal transport, the goal is to learn a noise process that minimizes the transport cost from data to noise, which in practice produces smoother diffusion trajectories that facilitate sampling. Minimizing Trajectory Curvature (MTC) of ODE-based generative models Lee et al. [27], the primary goal is to design the forward diffusion process that is optimal for fast sampling; however, Mu LAN strives to learn a forward process that optimizes for log-likelihood. In the former, the marginals xt in the forward process are given as xt = (1 t)x0 + tz; z qϕ(z|x0) (15) where xt, x0, 𭟋 Rd. However for Mu LAN the marginals are xt = αϕ(z, t) x0 + q 1 α2 ϕ(z, t) ϵ ; ϵ N(0, Id) ; z qϕ(z|x0) where αϕ(z, t) : Rd R Rd 0 , z {0, 1}m , ϵ Rd Notice that in the MTC formula, the coefficient of x0 , the time integral of the drift term, is a scalar and linear function of, and is independent of the input x0. In Mu LAN, that term is a multivariate non-linear function in t, and conditioned on x0 via the auxiliary latent variable z. This implies that the forward diffusion process in Mu LAN is more expressive than MTC. The simplistic forward process in MTC enables faster sampling whereas the richer / more expressive forward process in Mu LAN leads to improved likelihood estimates. Table 6 summarizes the key differences. Table 6: Comparison between Minimizing Trajectory Curvature and Mu LAN methods. Aspect Property Minimizing Trajectory Curvature Mu LAN Goal Design faster sampler Improve loglikelihood Learnable No Yes Linearity Linear in time t, linear in x0 Non-linear in time t, Non-linear in z (and hence x0) Dimensionality Scalar Multivariate Adaptive No Yes Diffusion Term Linearity Linear in time t Non-linear in time t Dimensionality Multivariate Multivariate Learnable Yes Yes Adaptive Yes Yes An alternative approach to learning a forward process that performs optimal transport is via the theory of Schrodinger bridges [47, 6, 59, 37] . Similarly to the DNF framework, these methods do not admit analytical marginals and therefore involve computing full trajectories from noisy and clean data. Additionally, they are typically trained using an iterative procedure that generalizes the sinkhorn algorithm and involves iteratively training q and p. As such, these types of methods are typically more expensive to train and competitive results on standard benchmarks (e.g., CIFAR10, Image Net) are not yet available to our knowledge. Appendix B Standard Diffusion Models We have a Gaussian diffusion process that begins with the data x0, and defines a sequence of increasingly noisy versions of x0 which we call the latent variables xt, where t runs from t = 0 (least noisy) to t = 1 (most noisy). Given, T, we discretize time uniformly into T timesteps each with a width 1/T. We define t(i) = i/T and s(i) = (i 1)/T. B.1 Forward Process q(xt|xs) = N(αt|sxs, σ2 t|s In) (16) where αt|s = αt σ2 t|s = σ2 t α2 t|s σ2s (18) B.2 Reverse Process Kingma et al. [20] show that the distribution q(xs|xt, x0) is also gaussian, q(xs|xt, x0) = N µq = αt|sσ2 s σ2 t xt + σ2 t|sαs σ2 t x0, Σq = σ2 sσ2 t|s σ2 t In Since during the reverse process, we don t have access to x0, we approximate it using a neural network xθ(xt, t) with parameters θ. Thus, pθ(xs|xt) = N µp = αt|sσ2 s σ2 t xt + σ2 t|sαs σ2 t xθ(xt, t), Σp = σ2 sσ2 t|s σ2 t In B.3 Variational Lower Bound This corruption process q is the following markov-chain as q(x0:1) = q(x0) QT i=1 q(xt(i)|xs(i)) . In the Reverse Process, or the Denoising Process, pθ, a neural network (with parameters θ) is used to denoise the noising process q. The Reverse Process factorizes as: pθ(x0:1) = pθ(x1) QT i=1 pθ(xs(i)|xt(i)). Let xθ(xt, t) be the reconstructed input by a neural network from xt. Similar to Sohl-Dickstein et al. [51], Kingma et al. [20] we decompose the negative lower bound (VLB) as: log pθ(x0) Eqϕ log pθ(xt(0):t(T )) q(xt(1):t(T )|x0) | {z } ELBO(pθ(x0:1),q(x0:1))defined in Eq. 1 = Ext(1) q(xt(1)|x0)[ log pθ(x0|xt(1))] i=2 Ext(i) q(xt(i)|x0)DKL[pθ(xs(i)|xt(i)) q(xs(i)|xt(i), x0)] + DKL[pθ(x1) qϕ(x1|x0)] = Ext(1) q(xt(1)|x0)[ log pθ(x0|xt(1))] | {z } Lrecons 2 Eϵ N(0,In),i U{2,T }DKL[pθ(xs(i)|xt(i)) q(xs(i)|xt(i), x0)] | {z } Ldiffusion + DKL[pθ(x1) q(x1|x0)] | {z } Lprior The prior loss, Lprior, and reconstruction loss, Lrecons, can be (stochastically and differentiably) estimated using standard techniques; see Kingma & Welling [23]. The diffusion loss, Ldiffusion, varies with the formulation of the noise schedule. We provide an exact formulation for it in the subsequent sections. B.4 Diffusion Loss For brevity, we use the notation s for s(i) and t for t(i). From Eq. 31 and Eq. 32 we get the following expression for q(xs|xt, x0): DKL(q(xs|xt, x0) pθ(xs|xt)) (µq µp) Σ 1 θ (µq µp) + tr ΣqΣ 1 p In log |Σq| 2(µq µp) Σ 1 θ (µq µp) Substituting µq, Σq, µp, Σp from equation 20 and equation 19; for the exact derivation see Kingma et al. [20] 2 (ν(s) ν(t)) (x0 xθ(xt, t)) 2 2 (22) Thus Ldiffusion is given by 2 Eϵ N(0,In),i U{2,T }DKL[pθ(xs(i)|xt(i)) qϕ(xs(i)|xt(i), x0)] = lim T 1 2 i=2 Eϵ N(0,In) (ν(s) ν(t)) x0 xθ(xt, t) 2 2 2Eϵ N(0,In) i=2 (ν(s) ν(t)) x0 xθ(xt, t) 2 2 2Eϵ N(0,In) i=2 T (ν(s) ν(t)) x0 xθ(xt, t) 2 2 1 T Substituting lim T T(ν(s) ν(t)) = d dtν(t) ν (t); see Kingma et al. [20] 2Eϵ N(0,In) 0 ν (t) x0 xθ(xt, t) 2 2 In practice instead of computing the integral is computed by MC sampling. 2Eϵ N(0,In),t U[0,1] ν (t) x0 xθ(xt, t) 2 2 (24) Appendix C Multivariate noise schedule For a multivariate noise schedule we have αt, σt Rd + where t [0, 1]. αt, σt are vectors. The timesteps s, t satisfy 0 s < t 1. Furthermore, we use the following notations where arithmetic division represents element wise division between 2 vectors: σ2 t|s = σ2 t α2 t|s σ2s (26) C.1 Forward Process q(xt|xs) = N αt|sxs, σ2 t|s (27) Change of variables. We can write xt explicitly in terms of the signal-to-noise ratio, ν(t), and input x0 in the following manner: νt = α2 t σ2 t We know α2 t = 1 σ2 t for Variance Preserving process; see Sec. 2. = 1 σ2 t σ2 t = νt = σ2 t = 1 1 + νt and α2 t = νt 1 + νt (28) νt = α2 t σ2 t We know α2 t = 1 σ2 t for Variance Preserving process; see Sec. 2. = 1 σ2 t σ2 t = νt = σ2 t = 1 1 + νt and α2 t = νt 1 + νt (29) Thus, we write xt in terms of the signal-to-noise ratio in the following manner: xν(t) = αtx0 + σtϵt; ϵt N(0, In) 1 + ν(t) x0 + 1 p 1 + ν(t) ϵt Using Eq. 28 (30) C.2 Reverse Process The distribution of xt given xs is given by: q(xs|xt, x0) = N µq = αt|sσ2 s σ2 t xt + σ2 t|sαs σ2 t x0, Σq = diag σ2 sσ2 t|s σ2 t Let xθ(xt, t) be the neural network approximation for x0. Then we get the following reverse process: pθ(xs|xt) = N µp = αt|sσ2 s σ2 t xt + σ2 t|sαs σ2 t xθ(xt, t), Σp = diag σ2 sσ2 t|s σ2 t C.3 Diffusion Loss For brevity we use the notation s for s(i) and t for t(i). From Eq. 31 and Eq. 32 we get the following expression for q(xs|xt, x0): DKL(q(xs|xt, x0) pθ(xs|xt)) (µq µp) Σ 1 θ (µq µp) + tr ΣqΣ 1 p In log |Σq| 2(µq µp) Σ 1 θ (µq µp) Substituting µq, µp, Σp from equation 32 and equation 31. σ2 t x0 σ2 t|sαs σ2 t xθ(xt, t) σ2 sσ2 t|s σ2 t ! 1 σ2 t|sαs σ2 t x0 σ2 t|sαs σ2 t xθ(xt, t) 2(x0 xθ(xt, t)) diag σ2 sσ2 t|s σ2 t (x0 xθ(xt, t)) 2(x0 xθ(xt, t)) diag σ2 t σ2 t σ2sσ2 t|s σ2 t|sαs (x0 xθ(xt, t)) 2(x0 xθ(xt, t)) diag σ2 t|sα2 s σ2 t σ2s (x0 xθ(xt, t)) Simplifying the expression using Eq. 25 and Eq. 26 we get, 2(x0 xθ(xt, t)) diag α2 s σ2s α2 t σ2 t (x0 xθ(xt, t)) Using the relation ν(t) = α2 t/σ2 t we get, 2(x0 xθ(xt, t)) diag (ν(s) ν(t)) (x0 xθ(xt, t)) (33) Like Kingma et al. [20] we train the model in the continuous domain with T . Ldiffusion = lim T 1 2 i=2 Eϵ N(0,In)DKL(q(xs(i)|xt(i), x0) pθ(xs(i)|xt(i))) = lim T 1 2 i=2 Eϵ N(0,In)(x0 xθ(xt(i), t(i))) diag νs(i) νt(i) (x0 xθ(xt(i), t)) 2Eϵ N(0,In) i=2 (x0 xθ(xt(i), t(i))) diag νs(i) νt(i) (x0 xθ(xt(i), t)) 2Eϵ N(0,In) i=2 T(x0 xθ(xt(i), t(i))) diag νs(i) νt(i) (x0 xθ(xt(i), t)) 1 Let lim T T(νs(i) νt(i)) = d dtν(t) denote the scalar derivative of the vector ν(t) w.r.t t 2Eϵ N(0,In) 0 (x0 xθ(xt, t)) diag d dtν(t) (x0 xθ(xt, t))dt (34) In practice instead of computing the integral is computed by MC sampling. 2Eϵ N(0,In),t U[0,1] (x0 xθ(xt, t)) diag d dtν(t) (x0 xθ(xt, t)) (35) C.4 Vectorized Representation of the diffusion loss Let ν(t) be the vectorized representation of the diagonal entries of the matrix ν(t). We can rewrite the integral in 34 in the following vectorized form where denotes element wise multiplication and , denotes dot product between 2 vectors. 0 (x0 xθ(xt, t)) diag d dtν(t) (x0 xθ(xt, t))dt (x0 xθ(xt, t)) (x0 xθ(xt, t)), d Using change of variables as mentioned in Sec. 3.2 we have (x0 xθ(xν(t), ν(t))) (x0 xθ(xν(t), ν(t))), d Let fθ(x0, ν(t)) = (x0 xθ(xν(t), ν(t))) (x0 xθ(xν(t), ν(t))) fθ(x0, ν(t)), d dtν(t) dt (36) Thus Ldiffusion can be interpreted as the amount of work done along the trajectory ν(0) ν(1) in the presence of a vector field fθ(x0, ν(z, t)). From the perspective of thermodynamics, this is precisely equal to the amount of heat lost into the environment during the process of transition between 2 equilibria via the noise schedule specified by ν(t). C.5 Log likelihood and Noise Schedules: A Thermodynamics perspective A diffusion model characterizes a quasi-static process that occurs between two equilibrium distributions: q(x0) q(x1), via a stochastic trajectory [51]. According to Spinney & Ford [55], it is demonstrated that the diffusion schedule or the noising process plays a pivotal role in determining the "measure of irreversibility" for this stochastic trajectory which is expressed as log PF (x0:1) PB(x1:0). PF (x0:1) represents the probability of observing the forward path x0:1 and PB(x1:0) represents the probability of observing the reverse path x1:0. It s worth noting that log PF (x0:1) PB(x1:0) corresponds precisely to the ELBO Eq. 1 that we optimize when training a diffusion model. Consequently, thermodynamics asserts that the noise schedule indeed has an impact on the log-likelihood of the diffusion model which contradicts Kingma et al. [20]. Appendix D Multivariate noise schedule conditioned on context Let s say we have a context variable c Rm that captures high level information about x0. αt(c), σt(c) Rd + are vectors. The timesteps s, t satisfy 0 s < t 1. Furthermore, we use the following notations: αt|s(c) = αt(c) σ2 t|s(c) = σ2 t (c) α2 t|s(c) σ2s(c) (38) The forward process for such a method is given as: qϕ(xt|xs, c) = N αt|s(c)xs, σ2 t|s(c) (39) The distribution of xt given xs is given by (the derivation is similar to Hoogeboom & Salimans [17]): qϕ(xs|xt, x0, c) µq = αt|s(c)σ2 s(c) σ2 t (c) xt + σ2 t|s(c)αs(c) σ2 t (c) x0, Σq = diag σ2 s(c)σ2 t|s(c) D.1 context is available during the inference time. Even though c represents the input x0, it could be available during during inference. For example c could be class labels [10] or prexisting embeddings from an auto-encoder [40]. D.1.1 Reverse Process: Approximate Let xθ(xt, c, t) be an approximation for x0. Then we get the following reverse process (for brevity we write xθ(xt, c, t) as xθ): pθ(xs|xt, c) = N µp = αt|s(c)σ2 s(c) σ2 t (c) xt + σ2 t|s(c)αs(c) σ2 t (c) xθ, Σp = diag σ2 s(c)σ2 t|s(c) D.1.2 Diffusion Loss Similar to the derivation of multi-variate Ldiffusion in Eq. 33 we can derive Ldiffusion for this case too: Ldiffusion = 1 2Eϵ N(0,In),t U[0,1] (x0 xθ(xt, c, t)) diag d dtν(t) (x0 xθ(xt, c, t)) D.1.3 Limitations of this method This approach is very limited where the diffusion process is only conditioned on class labels. Using pre-existing embeddings like Diff-AE [40] is also not possible in general and is only limited to tasks such as attribute manipulation in datasets. D.2 context isn t available during the inference time. If the context, c is an explicit function of the input x0 things become challenging because x0 isn t available during the inference stage. For this reason, Eq. 40 can t be used to parameterize µp, Σp in pθ(xs|xt). Let pθ(xs|xt) = N(µp(xt, t), Σp(xt, t)) where µp, Σp are parameterized directly by a neural network. Using Eq. 4 we get the following diffusion loss: Ldiffusion = T Ei U[0,T ]DKL q(xs(i)|xt(i), x0) pθ(xs(i)|xt(i)) 2 (µq µp) Σ 1 θ (µq µp) | {z } term 1 tr ΣqΣ 1 p In log |Σq| | {z } term 2 D.2.1 Reverse Process: Approximate Due to the challenges associated with parameterizing µp, Σp directly using a neural network we parameterize c using a neural network that approximates c in the reverse process. Let xθ(xt, t) be an approximation for x0. Then we get the following reverse Rrocess (for brevity we write xθ(xt, t) as xθ, and cθ denotes an approximation to c in the reverse process.): µp = αt|s(cθ)σ2 s(cθ) σ2 t (cθ) xt + σ2 t|s(cθ)αs(cθ) σ2 t (cθ) xθ, Σp = diag σ2 s(cθ)σ2 t|s(cθ) Consider the limiting case where T . Let s analyze the 2 terms in Eq. 43 separately. Using Eq. 4 and Eq. 6, term 1 in Eq. 43 simplifies in the following manner: 2 (µq µp) Σ 1 θ (µq µp) ((µq)i (µp)i)2 Substituting 1 / T as δ δσi2(xθ, t δ) 1 νi(xθ,t) νi(xθ,t δ) " αi(x, t δ) αi(x, t) νi(x, t) νi(x, t δ)zt + αi(x, t δ) 1 νi(x, t) νi(x, t δ) αi(xθ, t δ) αi(xθ, t) νi(xθ, t) νi(xθ, t δ)zt + αi(xθ, t δ) 1 νi(xθ, t) νi(xθ, t δ) Consider the scalar case: substituting δ = 1/T, δσ2(xθ, t δ) 1 ν(xθ,t) ν(xθ,t δ) " α(x, t δ) α(x, t) ν(x, t) ν(x, t δ)zt + α(x, t δ) 1 ν(x, t) ν(x, t δ) α(xθ, t) ν(xθ, t) ν(xθ, t δ)zt + α(xθ, t δ) 1 ν(xθ, t) ν(xθ, t δ) Notice that this equation is in indeterminate for when we substitute δ = 0. One can apply L Hospital rule twice or break it down into 3 terms below. For this reason let s write it as expression 1: lim δ 0 1 δ " α(x, t δ) α(x, t) ν(x, t) ν(x, t δ)zt + α(x, t δ) 1 ν(x, t) ν(x, t δ) α(xθ, t) ν(xθ, t) ν(xθ, t δ)zt + α(xθ, t δ) 1 ν(xθ, t) ν(xθ, t δ) expression 2: lim δ 0 1 1 ν(xθ,t) ν(xθ,t δ) " α(x, t δ) α(x, t) ν(x, t) ν(x, t δ)zt + α(x, t δ) 1 ν(x, t) ν(x, t δ) α(xθ, t) ν(xθ, t) ν(xθ, t δ)zt + α(xθ, t δ) 1 ν(xθ, t) ν(xθ, t δ) Applying L Hospital rule in expression 1 we get, α(x, t) ν(x, t) ν(x, t δ) δ=0 = ν(x, t) α(x, t) ν(x, t)α (x, t) + α(x, t)ν (x, t) α(x, t) + ν (x, t) ν(x, t) (50) d dδ α(x, t δ) 1 ν(x, t) ν(x, t δ) δ=0 = α(x, t)ν (x, t) ν(x, t) (51) α(x, t) + ν (x, t) ν(x, t) + α (xθ, t) α(xθ, t) ν (xθ, t) α(x, t)ν (x, t) ν(x, t) x + α(xθ, t)ν (xθ, t) ν(xθ, t) xθ ν (x, t) (53) Thus the final result: " αi (x, t) αi(x, t) + νi (x, t) νi(x, t) + αi (xθ, t) αi(xθ, t) νi (xθ, t) αi(x, t)νi (x, t) νi(x, t) x + αi(xθ, t)νi (xθ, t) νi(xθ, t) xθ = Λ diag ν(x, t) α(x, t) + ν (x, t) ν(x, t) + α (xθ, t) α(xθ, t) ν (xθ, t) zt α(x, t) ν (x, t) ν(x, t) x + α(xθ, t) ν (xθ, t) ν(xθ, t) xθ For the second term we have the following: tr ΣqΣ 1 p In log |Σq| tr diag σ2(c, s) 1 ν(c, t) diag σ2(cθ, s) 1 ν(cθ, t) diag σ2(c, s)(1 ν(c,t) ν(c,s)) diag σ2(cθ, s)(1 ν(cθ,t) σi2(c, s) 1 νi(c,t) σi2(cθ, s) 1 νi(cθ,t) νi(cθ,s) 1 log σi2(c, s) 1 νi(c,t) σi2(cθ, s) 1 νi(cθ,t) Let pi = σi 2(c,s) 1 νi(c,t) σi2(cθ,s) 1 νi(cθ,t) νi(cθ,s) The sequence lim T T 2 Pd i=1(pi 1 log pi) converges iff lim T Pd i=1(pi 1 log pi) = 0. Notice that the function f(x) = x 1 log x 0 x R and the equality holds for x = 1. Thus, the condition lim T T 2 Pd i=1(pi 1 log pi) holds iff lim T pi = 0 i {1, . . . , d}. Thus, lim T pi = 1 σi2(c, s) 1 νi(c,t) σi2(cθ, s) 1 νi(cθ,t) Substituting 1/T as δ, σi2(c, t δ) 1 νi(c,t) νi(c,t δ) σi2(cθ, t δ) 1 νi(cθ,t) νi(cθ,t δ) = σi2(c, t) σi2(cθ, t) lim δ 0+ 1 νi(c,t) νi(c,t δ)) 1 νi(cθ,t) νi(cθ,t δ) Applying L Hospital rule, = σi2(c, t) σi2(cθ, t) νi (c,t) νi(c,t) ) = σi2(c, t) σi2(cθ, t) νi (c, t)νi(cθ, t) νi(c, t)νi (cθ, t)) In the vector form the above equation can be written as, σ2 t (c)νt(cθ) tν(c, t) σ2 t (cθ)νt(c) tν(cθ, t) = 1d (58) Eq. 58 holds if: xθ = x0 i.e. the unet can perfectly map xt to x0 t [0, 1] which is unrealistic. Clever parameterizations for σ, α, ν that ensure Eq. 58 holds. Because of aforementioned challenges we evaluate this method with finite T = 1000. We demonstrate the performance of the model empirically in Fig. 4. D.2.2 Recovering VDM If we substitute νt(c), νt(cθ) with ν(t) (since the SNR isn t conditioned on the context c), σt(cθ), σt(c) with σt and αt(cθ), αt(c) with αt, Eq. 45 reduces to the intermediate loss in VDM i.e. 1 2(xθ x0) tν(t) (xθ x0) and Eq. 55 reduces to 0. D.3 Experimental results In Fig. 4 we demonstrate that the multivariate diffusion processes where c = class labels or c = x0 perform worse than VDM. Since a continuous time formulation i.e. T for the case when c = x0 isn t possible (unlike MULAN or VDM) we evaluate these models in the discrete time setting where we use T = 1000. Furthermore we also ablate T = 10k, 100k for c = x0 to show that the VLB degrades with increasing T whereas for VDM and MULAN it improves for increasing T; see Kingma et al. [20]. This empirical observation is consistent with our mathematical insights earlier. As these models consistently exhibit inferior performance w.r.t VDM, in line with our initial conjectures, we refrain from training them beyond 300k iterations due to the substantial computational cost involved. 0 50k 100k 150k 200k 250k Iterations Test loss (bits / dim) c = class labels, T=1k c = x0, T=100k c = x0, T=10k c = x0, T=1k VDM Figure 4: For c = class labels or c = x0 the likelihood estimates are worse than VDM. For c = x0, we see that the VLB degrades with increasing T, but for VDM and MULAN, it improves with increasing T. This empirical observation is consistent with our mathematical insights earlier. As these models consistently exhibit inferior performance w.r.t VDM, in line with our initial conjectures, we refrain from training them beyond 300k iterations due to the substantial computational cost involved. Table 7: Likelihood in bits per dimension (BPD) (mean and 95% confidence interval), on the test set of CIFAR-10 computed using VLB estimate. parameterization Num training steps CIFAR-10 ( ) Noise parameterization 10M 2.60 10 3 v-parameterization 8M 2.59 10 3 Appendix E MULAN: MUltivariate Latent Auxiliary variable Noise Schedule E.1 Parameterization in the reverse process E.1.1 Noise parameterization Since the forward pass is given by xt = αt(z)x0 + σt(z)ϵt, we can write the noise ϵt in terms of x0, xt in the following manner: ϵt = xt αt(z)x0 Following Dhariwal & Nichol [10], Kingma et al. [20], instead of parameterizing xθ(xt, z, t) using a neural network, we use Eq. 59 to parameterize the denoising model in terms of a noise prediction model ϵθ(xt, z, t), ϵθ(xt, z, t) = xt αt(z)xθ(xt, z, t) E.1.2 Velocity parameterization Following Salimans & Ho [45], Zheng et al. [65], we explore another parameterization of the denoising network which is given by vθ(xt, z, t) = αt(z)xt xθ(xt, z, t) In practice, v-parameterization leads to a better performance than noise parameterization; as illustrated in Table 7. E.2 Polynomial Noise Schedule Let f(x; ψ) be a scalar-valued polynomial of degree n with coefficients ψ Rn+1 expressed as: f(x; ψ) = ψnxn + ψn 1xn 1 + + ψ1x + ψ0, and denote its derivative with respect to x as d dxf(x; ψ), represented by f (x; ψ). Now we d like to find least n such that f(x; ψ) satisfies the following properties: 1. f(x; ψ) is monotonically increasing, i.e. f (x; ψ) 0 x R, ψ Rn+1. 2. f (x1; ψ) = 0, f (x2; ψ) = 0 x1, x2 C, x1 = x2, ψ Rn+1. For the first condition to hold, we can design f (x; ψ) such that it s a perfect square with real / imaginary roots. That way f (x; ψ) 0 x R, ψ Rn+1 . This means that f (x; ψ) is an even degree polynomial, i.e. the degree of f (x; ψ) can take the following values: 2, 4, . . . . Also, note that at least half of the roots of f (x; ψ) are repeated since f (x; ψ) can be expressed as a perfect square, i.e., if f (x; ψ) has a degree 2 then it has exactly 1 unique root (real / imaginary), if f (x; ψ) has a degree 4 then it has at most 2 unique roots (real / imaginary), and so on. For the second condition to hold, f (x; ψ) needs to have at least 2 unique roots ψ Rn+1. For this reason f (x; ψ) is a polynomial of degree 4. Thus, f (x; ψ) can be written as f (x; ψ) = (ψ3x2 + ψ2x+ψ1)2. This ensures that ψ R5 s.t. f (x; ψ) = 0 twice in x R, and f (x; ψ) 0 ψ R5. Thus, f(x; ψ) takes the following functional form: f(x; ψ) = Z (ψ3x2 + ψ2x + ψ1)2dx = ψ2 3 5 x5 + ψ3ψ2 2 x4 + ψ2 2 + 2ψ3ψ1 3 x3 + ψ2ψ1x2 + ψ2 1x + constant. (62) For the above-mentioned reasons we express γϕ(c, t) : Rm [0, 1] Rd as a degree 5 polynomial in t. We define neural networks aϕ(c) : Rm Rd, bϕ(c) : Rm Rd, and dϕ(c) : Rm Rd with parameters ϕ. Let fϕ : Rm [0, 1] Rd be defined as: fϕ(c, t) = a2 ϕ(c) 5 t5 + aϕ(c)bϕ(c) 2 t4 + b2 ϕ(c) + 2aϕ(c)dϕ(c) 3 t3 + bϕ(c)dϕ(c)t2 + d2 ϕ(c)t where the multiplication and division operations are elementwise. The the noise schedule, γ(c, t), is given as follows: γϕ(c, t) = γmin + (γmax γmin) fϕ(c, t) fϕ(c, t = 1) (63) Notice that γϕ(c, t) has these interesting properties: Is increasing in t [0, 1] which is crucial as mentioned in Sec. 3.5. γϕ(c, t) has end points at t = 0 and t = 1 which the user can specify via γmin and γmax. Specificaly, γϕ(c, t = 0) = γmin1d and γϕ(c, t = 1) = γmax1d. Its time-derivative i.e. tγϕ(c, t) can be zero twice in t [0, 1]. This isn t a necessary condition but it s nice to have a flexible noise schedule whose time-derivative can be 0 at the beginning and the end of the diffusion process. E.3 Variational Lower Bound In this section we derive the VLB. For ease of reading we use the notation xt to denote xt(i) and xt 1 to denote xt(i 1) xs(i) in the following derivation. log pθ(z, x0:T ) qϕ(z, x1:T |x0) log pθ(x0:T 1|z, x T ) qϕ(z, x1:T |x0) log pθ(x T ) log pθ(z) log pθ(x0:T 1|z, x T ) qϕ(x1:T |z, x0) log 1 qϕ(z|x0) log pθ(x T ) log pθ(z) log pθ(x0:T 1|z, x T ) qϕ(x1:T |z, x0) log pθ(x T ) log pθ(z) qϕ(z|x0) t=1 log pθ(xt 1|z, xt) qϕ(xt|xt 1, z, x0) log pθ(x T ) log pθ(z) qϕ(z|x0) log pθ(x0|z, x1) qϕ(x1|x0, z) t=2 log pθ(xt 1|z, xt) qϕ(xt|xt 1, z, x0) log pθ(x T ) log pθ(z) qϕ(z|x0) log pθ(x0|z, x1) qϕ(x1|x0, z) t=2 log pθ(xt 1|z, xt)qϕ(xt 1|z, x0) qϕ(xt 1|xt, z, x0)qϕ(xt|z, x0) log pθ(x T ) log pθ(z) qϕ(z|x0) log pθ(x0|z, x1) qϕ(x1|x0, z) t=2 log pθ(xt 1|z, xt) qϕ(xt 1|xt, z, x0) t=2 log qϕ(xt 1|z, x0) qϕ(xt|z, x0) log pθ(x T ) log pθ(z) qϕ(z|x0) log pθ(x0|z, x1) qϕ(x1|x0, z) t=2 log pθ(xt 1|z, xt) qϕ(xt 1|xt, z, x0) log q(x1|z, x0) qϕ(x T |z, x0) log pθ(x T ) log pθ(z) qϕ(z|x0) log pθ(x0|z, x1) t=2 log pθ(xt 1|z, xt) qϕ(xt 1|xt, z, x0) log 1 qϕ(x T |z, x0) log pθ(x T ) log pθ(z) qϕ(z|x0) log pθ(x0|z, x1) t=2 log pθ(xt 1|z, xt) qϕ(xt 1|xt, z, x0) log pθ(x T ) qϕ(x T |z, x0) log pθ(z) qϕ(z|x0) log pθ(x0|z, x1) | {z } Lrecons t=2 DKL[pθ(xt 1|z, xt) qϕ(xt 1|xt, z, x0)] | {z } Ldiffusion DKL[pθ(x T ) qϕ(x T |z, x0)] | {z } Lprior + DKL[pθ(z) q(z|x0)] | {z } Llatent Switching back to the notation used throughout the paper, the VLB is given as: log pθ(x0|z, x1) | {z } Lrecons i=2 DKL[pθ(xs(i)|z, xt(i)) qϕ(xs(i)|xt(i), z, x0)] | {z } Ldiffusion DKL[pθ(x1) qϕ(x1|z, x0)] | {z } Lprior + DKL[pθ(z) qϕ(z|x0)] | {z } Llatent E.4 Diffusion Loss To derive the diffusion loss, Ldiffusion in Eq. 9, we first derive an expression for DKL(qϕ(xs|z, xt, x0) pθ(xs|z, xt)) using Eq. 4 and Eq. 6 in the following manner (details in Suppl. E): DKL(qϕ(xs|z, xt, x0) pθ(xs|z, xt)) (µqϕ µp) Σ 1 θ (µqϕ µp) + tr ΣqϕΣ 1 p In log |Σqϕ| 2 (x0 xθ) diag(ν(z, s) ν(z, t))(x0 xθ) (66) Let lim T T(νs(z) νt(z)) = tν(z, t) be the partial derivative of the vector ν(z, t) w.r.t scalar t. Then we derive the diffusion loss, Ldiffusion, for the continuous case in the following manner (for brevity we use the notation s for s(i) = (i 1)/T and t for t(i) = i/T): = lim T 1 2 i=2 Eϵ N(0,In)DKL(q(xs|xt, x0, z) pθ(xs|xt, z)) Using Eq. 66 we get, = lim T 1 2 i=2 Eϵ N(0,In)(x0 xθ(xt, t(i))) diag (ν(s(i), z) ν(t(i), z)) (x0 xθ(xt, t(i))) 2Eϵ N(0,In) i=2 T(x0 xθ(xt, t(i))) diag (ν(s(i), z) ν(t(i), z)) (x0 xθ(xt, t(i))) 1 Using the fact that lim T T (ν(s, z) ν(z, t)) = tν(t, z) we get, 2Et {0,...,1} (x0 xθ(xt, t)) ( tνt(z)) (x0 xθ(xt, t)) Substituting x0 = α 1 t (z)(xt σt(z)ϵt) from Eq. 59 and Substituting xθ(xt, z, t) = α 1 t (z)(xt σt(z)ϵθ(xt, t)) from Eq. 60 we get, (ϵt ϵθ(xt, t)) σ2 t (z) α2 t(z) tνt(z) (ϵt ϵθ(xt, t)) Let ν 1(z, t) denote the reciprocal of the values in the vector ν(z, t). 2Et [0,1] (ϵt ϵθ(xt, t)) diag ν 1(t)(z) tνt(z) (ϵt ϵθ(xt, t)) Substituting ν(z, t) = exp( γ(z, t)) from Sec. E.1.1 2Et [0,1] (ϵt ϵθ(xt, t)) diag (exp (γ(z, t)) t exp ( γ(z, t))) (ϵt ϵθ(xt, t)) 2Et [0,1] (ϵt ϵθ(xt, t)) diag (exp (γ(z, t)) exp ( γ(z, t)) tγ(z, t)) (ϵt ϵθ(xt, t)) 2Et [0,1] (ϵt ϵθ(xt, t)) diag ( tγ(z, t)) (ϵt ϵθ(xt, t)) (67) E.5 Recovering VDM from the Vectorized Representation of the diffusion loss Notice that we recover the loss function in VDM when ν(z, t) = ν(t)1d where νt R+ and 1d represents a vector of 1s of size d and the noising schedule isn t conditioned on z. 0 fθ(x0, ν(z, t)), d dtν(t) dt = Z 1 0 fθ(x0, ν(t)), d dt(ν(t)1n) dt 0 fθ(x0, ν(t)), 1d ν (t)dt 0 ν (t) fθ(x0, ν(t)) 1 1dt 0 ν (t) (x0 xθ(xν(t), ν(t))) 2 2dt (68) R 1 0 d dtν(t) (x0 xθ(xν(t), ν(t))) 2 2dt denotes the diffusion loss, Ldiffusion, as used in VDM; see Kingma et al. [20]. Figure 5: (a) Imagine piloting a plane across a region with cyclones and strong winds, as shown in Fig. 5. Plotting a direct, straight-line course through these adverse weather conditions requires more fuel and effort due to increased resistance. By navigating around the cyclones and winds, however, the plane reaches its destination with less energy, even if the route is longer.This intuition translates into mathematical and physical terms. The plane s trajectory is denoted by r(t) Rn +, while the forces acting on it are represented by f(r(t)) Rn. The work required to navigate is given by R 1 0 f(r(t)) d dtr(t), dt. Here, the work depends on the trajectory because f(r(t)) is not a conservative field. (b) This concept also applies to the diffusion NELBO. From Eq. 12, it s clear that the trajectory r(t) is parameterized by the noise schedule ν(z, t), which is influenced by complex forces, f (analogous to weather patterns), represented by the dimension-wise reconstruction error of the denoising model, (x0 xθ(xt, z, t))2. Thus, the diffusion loss, Ldiffusion, can be interpreted as the work done along the trajectory ν(z, t) in the presence of these vector field forces f. By learning the noise schedule, we can avoid high-resistance paths (those where the loss accumulates rapidly), thereby minimizing the overall energy expended, as measured by the NELBO. Appendix F Subset Sampling Sampling a subset of k items from a collection of collection of n items, x1, x2, . . . , x3 belongs to a category of algorithms called reservoir algorithms. In weighted reservoir sampling, every xi is associated with a weight wi 0. The probability associated with choosing the sequence Swrs = [i1, i2, . . . , ik] be a tuple of indices. Then the probability associated with sampling this sequence is p(Swrs|w) = wi1 Z wi2 Z wi1 . . . wik Z Pk 1 j=1 wij (69) Efraimidis & Spirakis [13] give an algorithm for weighted reservoir sampling where each item is assigned a random key ri = u 1 wi i where ui is drawn from a uniform distribution [0, 1] and wi is the weight of item xi. Let Top K(r, k) which takes keys r = [r1, r2, . . . , rn] and returns a sequence [i1, i2, . . . , ik]. Efraimidis & Spirakis [13] proved that Top K(r, k) is distributed according to p(Swrs|w). Let s represent a subset S {0, 1}n with exactly k non-zero elements that are equal to 1. Then the probability associated with sampling S is given as, Swrs Π(S) p(Swrs|w) (70) where Π(S) denotes all possible permutations of the sequence S. By ignoring the ordering of the elements in Swrs we can sample using the same algorithm. Xie & Ermon [61] show that this sampling algorithm is equivalent to Top K(ˆr, k) where ˆr = [ˆr1, ˆr2, . . . , ˆrn] where ˆri = log( log(ri)) = log wi+ Gumbel(0, 1). This holds true because the monotonic transformation log( log(x)) preserves the ordering of the keys and thus Top K(r, k) Top K(ˆr, k). Sum of Gamma Distribution. Niepert et al. [34] show that adding SOG noise instead of Gumbel noise leads to better performance. Niepert et al. [34] show that z pθ(z; θ) is equivalent to z = arg maxy Y θ + ϵg, y where ϵg is a sample from Sum-of-Gamma distribution given by So G(k, τ, s) = τ i=1 Gamma 1 log s , (71) where s is a positive integer and Gamma(α, β) is the Gamma distribution with (α, β) as the shape and scale parameters. And hence, given logits log w, we sample a k-hot vector using Top K(log w + ϵ). We choose a categorical prior with uniform distribution across n classes. Thus the KL loss term is given by: Appendix G Experiment Details G.1 Model Architecture Denoising network. Our model architecture is extremely similar to VDM. The UNet of our pixelspace diffusion has an unchanged architecture from Kingma et al. [20].This structure is specifically designed for optimal performance in maximum likelihood training. We employ features from VDM such as the elimination of internal downsampling/upsampling processes and the integration of Fourier features to enhance fine-scale prediction accuracy. In alignment with the configurations suggested by Kingma et al. (2021), our approach varies depending on the dataset: For CIFAR-10, we employ a U-Net with a depth of 32 and 128 channels; for Image Net-32, the U-Net also has a depth of 32, but the channel count is increased to 256. Additionally, all these models incorporate a dropout rate of 0.1 in their intermediate layers. Encoder network. qϕ(z|x) is modeled using a sequence of 4 Resnet blocks with a channel count of 128 for CIFAR-10 and 256 for Image Net-32 with a drop out of 0.1 in their intermediate layers. Noise schedule. For polynomial noise schedule, we use an MLP that maps the latent vector z to aϕ(z), bϕ(z), c(z); see Eq. E.2 for details. The MLP has 2 hidden layers of size 3072 with swish activation function. The final layer is a linear mapping to 3 3072 values corresponding to aϕ(z), bϕ(z), c(z). Note that aϕ(z), bϕ(z), c(z) have the same dimensionality of 3072. G.2 Hardware. For the Image Net experiments, we used a single GPU node with 8-A100s. For the cifar-10 experiments, the models were trained on 4 GPUs spanning several GPUs types like V100, A5000s, A6000s, and 3090s with float32 precision. G.3 Hyperparameters We follow the same default training settings as Kingma et al. [20]. For all our experiments, we use the Adam [21] optimizer with learning rate 2 10 4, exponential decay rates of β1 = 0.9, β2 = 0.99 and decoupled weight decay [29] coefficient of 0.01. We also maintain an exponential moving average (EMA) of model parameters with an EMA rate of 0.9999 for evaluation. For other hyperparameters, we use fixed start and end times which satisfy γmin = 13.3, γmax = 5.0, which is used in Kingma et al. [20], Zheng et al. [65]. Appendix H Datasets and Visualizations In this section we provide a brief description of the datasets used in the paper and visualize the generated samples and the noise schedules. H.1 CIFAR-10 The CIFAR-10 dataset [25] is a collection of images consisting of 60,000 32 32 color images in 10 different classes, with each class representing a distinct object or scene. The dataset is divided into 50,000 training images and 10,000 test images, with each class having an equal representation in both sets. The classes in CIFAR-10 include: Airplane, Automobile, Bird, Cat, Deer, Dog, Frog, Horse, Ship, Truck. Randomly generated samples for the CIFAR-10 datasaet are provided in Fig. 6a for MULAN and Fig. 6b for VDM. We visualize the noise schedule in Fig. 13. (a) MULAN with velocity reparameterization after 8M training iterations. (b) VDM after 10M training iterations. Figure 6: CIFAR-10 samples generated by different methods. H.2 Image Net-32 Image Net-32 is a dataset derived from Image Net [7], where the original images have been resized to a resolution of 32 32. This dataset comprises 1,281,167 training samples and 50,000 test samples, distributed across 1,000 labels. Randomly generated samples for the Image Net datasaet are provided in Fig. 7 for MULAN and Fig. 8 for VDM. We visualize the noise schedule in Fig. 13. Figure 7: MULAN with noise parameterization after 2M training iterations. Figure 8: VDM after 2M training iterations. H.3 Frequency To see if MULAN learns different noise schedules for parts of the images with different frequencies, we modify the images in the CIFAR-10 dataset where we modify an image where we randomly remove the low frequency component an image or remove the high frequency with equal probability. Fig. 9a shows the training samples. MULAN was trained for 500K steps. The samples generated by MULAN is shown in Fig. 9b. The corresponding noise schedules is shown in Fig. 13. As compared to CIFAR-10, we notice that the spatial variation in the noise schedule increases (SNRs for all the pixels form a wider band) while the variance of the noise schedule across instances decreases slightly. (a) Training samples. (b) Samples generated by MULAN with noise parameterization after 500K training iterations. Figure 9: Frequency Split CIFAR-10 dataset. H.4 Frequency-2 To see if MULAN learns different noise schedules for images with different frequencies, we modify the images in the CIFAR-10 dataset where we modify an image where we randomly remove the low frequency component an image or remove the high frequency with equal probability. Fig. 9a shows the training samples. MULAN was trained for 500K steps. The samples generated by MULAN is shown in Fig. 9b. The corresponding noise schedules is shown in Fig. 13. As compared to CIFAR-10, we notice that the spatial variation in the noise schedule increases (SNRs for all the pixels form a wider band) and the variance of the noise schedule across instances increases as well. (a) Training samples. (b) Samples generated by MULAN with noise parameterization after 500K training iterations. Figure 10: Frequency Split-2 CIFAR-10 dataset. H.5 CIFAR-10: Intensity To see if MULAN learns different noise schedules for images with different intensities, we modify the images in the CIFAR-10 dataset where we randomly convert an image into a low intensity or a high intensity image with equal probability. Originally, the CIFAR10 images are in the range [0, 255]. To convert an image into a low intensity image we multiply all pixel values by 0.5. To convert an image into a high intensity image we multiply all the pixel values by 0.5 and add 127.5 to them. Fig. 11a shows the training samples. MULAN was trained for 500K steps. The samples generated by MULAN is shown in Fig. 11b. The corresponding noise schedules is shown in Fig. 13. As compared to CIFAR-10, we notice that the spatial variation in the noise schedule slightly increases (SNRs for all the pixels form a wider band) while the variance of the noise schedule across instances slightly decreases. (a) Training samples. (b) Samples generated by MULAN with noise parameterization after 500K training iterations. Figure 11: Intensity CIFAR-10 dataset. We modify the CIFAR-10 dataset where we randomly mask (i.e. replace with 0s) the top of an image or the bottom half of an image with equal probability. Fig. 12a shows the training samples. MULAN was trained for 500K steps. The samples generated by MULAN is shown in Fig. 12b. The corresponding noise schedules is shown in Fig. 13. As compared to CIFAR-10, we notice that the spatial variation in the noise schedule slightly increases (SNRs for all the pixels form a wider band) while the variance of the noise schedule across instances decreases. (a) Training samples. (b) Samples generated by MULAN with noise parameterization after 500K training iterations. Figure 12: Intensity CIFAR-10 dataset. CIFAR-10 Image Net CIFAR-10 Frequency Split CIFAR-10 Intensity Split CIFAR-10 Frequency Split - 2 CIFAR-10 Figure 13: signal-to-noise ratio for different datasets. Appendix I Likelihood Estimation We used both Variance Lower Bound (VLB) and ODE-based methods to compute BPD. I.1 VLB Estimate In the VLB-based approach, we employ Eq. 9. To compute Ldiffusion, we use T = 128 in Eq. 10, discretizing the timesteps, t [0, 1] into 128 bins. I.2 Exact likelihood computation using Probability Flow ODE A diffusion process whose marginal is given by (the same as in Eq. 2), q(xt|x0) = N(xt; αtx0, diag(σ2 t )), x0 q0(x0), (73) can be modeled as the solution to an Itˆo Stochastic Differential Equation (SDE): dxt = f(t)xtdt + g(t)dwt, x0 q0(x0), (74) where f(t) Rd, g(t) Rd take the following expressions [53]: dtσ2 t 2σ2 t d dt log αt The corresponding reverse process, Eq. 4, can also be modelled by an equivalent reverse-time SDE: dxt = [f(t) g(t)2 xt log q(xt|x0)]dt + g(t)d wt, x1 pθ(x1), (75) where w is a standard Wiener process when time flows backwards from 1 0, and dt is an infinitesimal negative timestep. Song et al. [53] show that the marginals of Eq. 75 can be described by the following Ordinary Differential Equation (ODE) in the reverse process: dxt = f(t)xt 1 2g2(t) xt log q(xt) dt. (76) This ODE, also called the probablity flow ODE, allows us to compute the exact likelihood on any input data via the instantaneous change of variables formula as proposed in Chen et al. [2]. Note that during the reverse process, the term q(xt) is unknown and is approximated by parameterized by pθ(xt). For the probability flow defined in Eq. 76, Chen et al. [2] show that the loglikelihood of pθ(x0) can be computed using the following equation: log pθ(x0) = log pθ(x1) Z t=1 t=0 tr ( xthθ(xt, t)) dt, (77) where hθ(xt, t) f(t)xt 1 2g2(t) xt log pθ(xt) I.2.1 Probability Flow ODE for MULAN. Similarly for the forward process conditioned on the auxiliary latent variable, z, qϕ(xt|x0, z) = N(xt; αt(z)x0, diag(σ2 t (z))), x0 q0(x0), z qϕ(z|x0), (78) we can extend Eq. 74 in the following manner, dxt = f(z, t)xtdt + g(z, t)dwt, x0 q0(x0), z qϕ(z|x0), (79) to obtain the corresponding SDE formulation. Notice that the random variable z in the above equation doesn t have a subscript t, and hence, z is drawn from qϕ(z|x0) once and the same z is used as x0 diffuses to x1. The expressions for f(z, t) : Rm [0, 1] Rd, g(z, t) : Rm [0, 1] Rd is given as follows: f(z, t) = d dt log αt(z), g2(z, t) = d dtσ2 t (z) 2σ2 t (z) d dt log αt(z) Recall that α2 t(z) = sigmoid( γϕ(z, t)), σ2 t (z) = sigmoid(γϕ(z, t)). Substituting these in the above equations, the expressions for f(z, t) and g2(z, t) simplify to the following: f(z, t) = 1 2sigmoid(γϕ(z, t)) d dtγϕ(z, t), g2(z, t) = sigmoid(γϕ(z, t)) d The corresponding reverse-time SDE is given as: dxt = [f(t) g(t)2 xt log qϕ(xt|x0, z)]dt + g(t)d wt, x1 pθ(x1), z pθ(z), (80) where w is a standard Wiener process when time flows backwards from 1 0, and dt is an infinitesimal negative timestep. Given, sθ(xt, z), an approximation to the true score function, xt log qϕ(xt|x0, z), Song et al. [53] show that the marginals of Eq. 80 can be described by the following Ordinary Differential Equation (ODE): dxt = f(z, t) 1 2g2(z, t)sθ(xt, z) dt, (81) Zheng et al. [65] show that the score function, sθ(xt, z), for the noise and the v-parameterization is given as follows: sθ(xt, z) = σt(z) Noise parameterization; see Sec. E.1.1 (82a) 2γϕ(z, t) vθ(xt, z, t) v-parameterization; see Sec. E.1.2 (82b) Applying the change of variables formula [2] on Eq. 81, log pθ(x0|z) can be computed in the following manner: log pθ(x0|z) = log pθ(x1) Z t=1 t=0 tr ( xthθ(xt, z, t)) dt, (83) where hθ(xt, z, t) f(z, t) 1 2g2(z, t)sθ(xt, z) The expression for log-likelihood (Eq. 8) is as follows, log pθ(x0) Eqϕ(z|x0)[log pθ(x0|z)] DKL(qϕ(z|x0) pθ(z)) Using Eq. 83, = Eqϕ(z|x0) log pθ(x1) Z t=1 t=0 tr ( xthθ(xt, t, z)) dt DKL(qϕ(z|x0) pθ(z)) Computing tr ( xthθ(xt, t, z)) is expensive and we follow Chen et al. [2], Zheng et al. [65], Grathwohl et al. [15] to estimate it with Skilling-Hutchinson trace estimator [50, 19]. In particular, we have tr ( xthθ(xt, t, z)) = Ep(ϵ) ϵ xthθ(xt, t, z)ϵ , (85) where the random variable ϵ satisfies Ep(ϵ)[ϵ] = 0 and Covp(ϵ)[ϵ] = I. Common choices for p(ϵ) include Rademacher or Gaussian distributions. Notably, the term xthθ(xt, t, z)ϵ can be computed efficiently using Jacobian-vector-product computation in JAX. In our experiments, we follow the exact evaluation procedure for computing likelihood as outlined in Song et al. [53], Grathwohl et al. [15]. Specifically, for the computation of Eq. 85, we employ a Rademacher distribution for p(ϵ). To calculate the integral in Eq. 84, we utilize the RK45 ODE solver [11] provided by scipy.integrate.solve_ivp with atol=1e-5 and rtol=1e-5. I.2.2 Dequantization. Real-world datasets for images or texts often consist of discrete data. Attempting to learn a continuous density model directly on these discrete data points can lead to degenerate outcomes [56] and fail to provide meaningful density estimations. Dequantization [46, 16, 65] is a common solution in such cases. To elaborate, let x0 represent 8-bit discrete data scaled to [-1, 1]. Dequantization methods assume that we have trained a continuous model distribution pθ for x0, and define the discrete model distribution by [ 1 256 , 1 256 )d pθ(x0 + u)du. To train Pθ(x0) by maximum likelihood estimation, variational dequantization [16, 65] introduces a dequantization distribution q(u|x0) and jointly train pmodel and q(u|x0) by a variational lower bound: log Pθ(x0) Eq(u|x0)[pθ(x0 + u) log q(u|x0)]. (86) Truncated Normal Dequantization. Zheng et al. [65] show that truncated Normal distribution, q(u|x0) = T N 0, I, 1 with mean 0, covariance I, and bounds 1 256, 1 256 along each dimension, leads to a better likelihood estimate. Thus, Eq. 86 simplifies to the following (for details please refer to section A. in Zheng et al. [65]): log Pθ(x0) Eˆϵ T N(0,I, τ,τ) 2(1 + log(2πσ2 ϵ )) 0.01522 d (87) αϵ = exp( 1 σϵ = sqrt(sigmoid( 13.3)), and τ = 3. log pθ x0 + σϵ αϵˆϵ is evaluated using Eq. 84. Importance Weighted Estimator. Eq. 87 can also be extended to obtain an importance weighted likelihood estimator to get a tighter bound on the likelihood. The variational bound is given by (for details please refer to section A. in Zheng et al. [65]): log Pθ(x0) Eˆϵ(1),...,ˆϵ(K) T N(0,I, τ,τ) + d log σϵ (88) αϵ = exp( 1 2 13.3), log σϵ = 1 2( 13.3 + softplus( 13.3)), q(ˆϵ) = 1 (2πZ)2 exp 1 , Z = 0.9974613, and τ = 3. Note that for K = 1, Eq. 88 is equivalent to Eq. 87; see Zheng et al. [65]. log pθ x0 + σϵ αϵˆϵ is evaluated using Eq. 84. In Table 8, we report BPD values for MULAN on CIFAR10 (8M training steps, v-parameterization) and Image Net (2M training steps, noise parameterization) using both the VLB-based approach, and the ODE-based approach with K = 1 and K = 20 importance samples. Table 8: NLL (mean and 95% Confidence Interval for MULAN) on CIFAR10 (8M training steps, v-parameterization) and Image Net (2M training steps, noise parameterization) using both the VLBbased approach, and the ODE-based approach. K = 1 means that we do not use importance weighted estimator since Eq. 88 is equivalent to Eq. 87 for this case; see Zheng et al. [65]. Likelihood Estimation type CIFAR-10 ( ) Imagenet ( ) VLB-based 2.59 10 3 3.71 10 3 ODE-based (K = 1; Eq. 87) 2.59 3 10 4 3.71 10 3 ODE-based (K = 20; Eq. 88) 2.55 3 10 4 3.67 10 3 Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: See our introduction for a list of claims including getting SOTA results on density estimation. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Yes, our model does not get state of the art FID due to it not having a lower frequency bias. See the paper for more details. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: Please see our detailed proofs. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Not only do we show all equations and train on standard datasets, we will open source the code. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We will open source after paper acceptance. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Yes, we include all hyperparameters in the paper and will open source code. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: we report the deviatations for BPD in Table 2. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide this in the paper. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Our paper is just a diffusion model useful for compression. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We do not believe our method will have a high risk of abuse as our models are not perceptually SOTA, they only provide for state of the art logliklihood. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: We are using standard benchmark datasets. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: No crowdsource or research with human subjets Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.