# learning_energybased_models_by_diffusion_recovery_likelihood__6edda373.pdf Published as a conference paper at ICLR 2021 LEARNING ENERGY-BASED MODELS BY DIFFUSION RECOVERY LIKELIHOOD Ruiqi Gao UCLA ruiqigao@ucla.edu Yang Song Stanford University yangsong@cs.stanford.edu Ben Poole Google Brain pooleb@google.com Ying Nian Wu UCLA ywu@stat.ucla.edu Diederik P. Kingma Google Brain durk@google.com While energy-based models (EBMs) exhibit a number of desirable properties, training and sampling on high-dimensional datasets remains challenging. Inspired by recent progress on diffusion probabilistic models, we present a diffusion recovery likelihood method to tractably learn and sample from a sequence of EBMs trained on increasingly noisy versions of a dataset. Each EBM is trained with recovery likelihood, which maximizes the conditional probability of the data at a certain noise level given their noisy versions at a higher noise level. Optimizing recovery likelihood is more tractable than marginal likelihood, as sampling from the conditional distributions is much easier than sampling from the marginal distributions. After training, synthesized images can be generated by the sampling process that initializes from Gaussian white noise distribution and progressively samples the conditional distributions at decreasingly lower noise levels. Our method generates high fidelity samples on various image datasets. On unconditional CIFAR-10 our method achieves FID 9.58 and inception score 8.30, superior to the majority of GANs. Moreover, we demonstrate that unlike previous work on EBMs, our long-run MCMC samples from the conditional distributions do not diverge and still represent realistic images, allowing us to accurately estimate the normalized density of data even for high-dimensional datasets. Our implementation is available at https://github.com/ruiqigao/recovery_likelihood. 1 INTRODUCTION EBMs (Le Cun et al., 2006; Ngiam et al., 2011; Kim & Bengio, 2016; Zhao et al., 2016; Goyal et al., 2017; Xie et al., 2016b; Finn et al., 2016; Gao et al., 2018; Kumar et al., 2019; Nijkamp et al., 2019b; Du & Mordatch, 2019; Grathwohl et al., 2019; Desjardins et al., 2011; Gao et al., 2020; Che et al., 2020; Grathwohl et al., 2020; Qiu et al., 2019; Rhodes et al., 2020) are an appealing class of probabilistic models, which can be viewed as generative versions of discriminators (Jin et al., 2017; Lazarow et al., 2017; Lee et al., 2018; Grathwohl et al., 2020), yet can be learned from unlabeled data. Despite a number of desirable properties, two challenges remain for training EBMs on highdimensional datasets. First, learning EBMs by maximum likelihood requires Markov Chain Monte Carlo (MCMC) to generate samples from the model, which can be extremely expensive. Second, as pointed out in Nijkamp et al. (2019a), the energy potentials learned with non-convergent MCMC do not have a valid steady-state, in the sense that samples from long-run Markov chains can differ greatly from observed samples, making it difficult to evaluate the learned energy potentials. Another line of work, originating from Sohl-Dickstein et al. (2015), is to learn from a diffused version of the data, which are obtained from the original data via a diffusion process that sequentially adds Gaussian white noise. From such diffusion data, one can learn the conditional model of the data at a certain noise level given their noisy versions at the higher noise level of the diffusion process. After learning the sequence of conditional models that invert the diffusion process, one can then generate synthesized images from Gaussian white noise images by ancestral sampling. Building on Published as a conference paper at ICLR 2021 Figure 1: Generated samples on LSUN 1282 church outdoor (left), LSUN 1282 bedroom (center) and Celeb A 642 (right). Sohl-Dickstein et al. (2015), Ho et al. (2020) further developed the method, obtaining strong image synthesis results. Inspired by Sohl-Dickstein et al. (2015) and Ho et al. (2020), we propose a diffusion recovery likelihood method to tackle the challenge of training EBMs directly on a dataset by instead learning a sequence of EBMs for the marginal distributions of the diffusion process. The sequence of marginal EBMs are learned with recovery likelihoods that are defined as the conditional distributions that invert the diffusion process. Compared to standard maximum likelihood estimation (MLE) of EBMs, learning marginal EBMs by diffusion recovery likelihood only requires sampling from the conditional distributions, which is much easier than sampling from the marginal distributions. After learning the marginal EBMs, we can generate synthesized images by a sequence of conditional samples initialized from the Gaussian white noise distribution. Unlike Ho et al. (2020) that approximates the reverse process by normal distributions, in our case the conditional distributions are derived from the marginal EBMs, which are more flexible. The framework of recovery likelihood was originally proposed in Bengio et al. (2013). In our work, we adapt it to learning the sequence of marginal EBMs from the diffusion data. Our work is also related to the denoising score matching method of Vincent (2011), which was further developed by Song & Ermon (2019; 2020) for learning from diffusion data. The training objective used for diffusion probabilisitic models is a weighted version of the denoising score matching objective, as revealed by Ho et al. (2020). These methods learn the score functions (the gradients of the energy functions) directly, instead of using the gradients of learned energy functions as in EBMs. On the other hand, Saremi et al. (2018) parametrizes the score function as the gradient of a MLP energy function, and Saremi & Hyvarinen (2019) further unifies denoising score matching and neural empirical Bayes. We demonstrate the efficacy of diffusion recovery likelihood on CIFAR-10, Celeb A and LSUN datasets. The generated samples are of high fidelity and comparable to GAN-based methods. On CIFAR-10, we achieve FID 9.58 and inception score 8.30, exceeding existing methods of learning explicit EBMs to a large extent. We also demonstrate that diffusion recovery likelihood outperforms denoising score matching from diffusion data if we naively take the gradients of explicit energy functions as the score functions. More interestingly, by using a thousand diffusion time steps, we demonstrate that even very long MCMC chains from the sequence of conditional distributions produce samples that represent realistic images. With the faithful long-run MCMC samples from the conditional distributions, we can accurately estimate the marginal partition function at zero noise level by importance sampling, and thus evaluate the normalized density of data under the EBM. Published as a conference paper at ICLR 2021 Figure 3: Illustration of diffusion recovery likelihood on 2D checkerboard example. Top: progressively generated samples. Bottom: estimated marginal densities. 2 BACKGROUND Let x pdata(x) denote a training example, and pθ(x) denote a model s probability density function that aims to approximates pdata(x). An energy-based model (EBM) is defined as: Zθ exp(fθ(x)), (1) where Zθ = R exp(fθ(x))dx is the partition function, which is analytically intractable for highdimensional x. For images, we parameterize fθ(x) with a convolutional neural network with a scalar output. The energy-based model in equation 1 can, in principle, be learned through MLE. Specifically, suppose we observe samples xi pdata(x) for i = 1, 2, ..., n. The log-likelihood function is i=1 log pθ(xi) .= Ex pdata[log pθ(x)]. (2) In MLE, we seek to maximize the log-likelihood function, where the gradient approximately follows (Xie et al., 2016b) θDKL(pdata pθ) = Ex pdata θfθ(x) Ex pθ θfθ(x) . (3) The expectations can be approximated by averaging over the observed samples and the synthesized samples drawn from the model distribution pθ(x) respectively. Generating synthesized samples from pθ(x) can be done with Markov Chain Monte Carlo (MCMC) such as Langevin dynamics (or Hamiltonian Monte Carlo (Girolami & Calderhead, 2011)), which iterates xτ+1 = xτ + δ2 2 xfθ(xτ) + δϵτ, (4) Figure 2: Comparison of learning EBMs by diffusion recovery likelihood (Ours) versus marginal likelihood (Short-run). where τ indexes the time, δ is the step size, and ϵτ N(0, I). The difficulty lies in the fact that for highdimensional and multi-modal distributions, MCMC sampling can take a long time to converge, and the sampling chains may have difficulty traversing modes. As demonstrated in Figure 2, training EBMs with synthesized samples from non-convergent MCMC results in malformed energy landscapes (Nijkamp et al., 2019b), even if the samples from the model look reasonable. 3 RECOVERY LIKELIHOOD 3.1 FROM MARGINAL TO CONDITIONAL Given the difficulty of sampling from the marginal density pθ(x), following Bengio et al. (2013), we use the recovery likelihood defined by the density of the Published as a conference paper at ICLR 2021 observed sample conditional on a noisy sample perturbed by isotropic Gaussian noise. Specifically, let x = x + σϵ be the noisy observation of x, where ϵ N(0, I). Suppose pθ(x) is defined by the EBM in equation 1, then the conditional EBM can be derived as pθ(x| x) = 1 Zθ( x) exp fθ(x) 1 2σ2 x x 2 , (5) where Zθ( x) = R exp fθ(x) 1 2σ2 x x 2 dx is the partition function of this conditional EBM. See Appendix A.1 for the derivation. Compared to pθ(x) (equation 1), the extra quadratic term 1 2σ2 x x 2 in pθ(x| x) constrains the energy landscape to be localized around x, making the latter less multi-modal and easier to sample from. As we will show later, when σ is small, pθ(x| x) is approximately a single mode Gaussian distribution, which greatly reduces the burden of MCMC. A more general formulation is x = ax + σϵ, where a is a positive constant. In that case, we can let y = ax, and treat y as the observed sample. Assume pθ(y) = 1 Zθ exp(fθ(y)), then by change of variable, the density function of x can be derived as gθ(x) = apθ(ax). 3.2 MAXIMIZING RECOVERY LIKELIHOOD With the conditional EBM, assume we have observed samples xi pdata(x) and the corresponding perturbed samples xi = xi + σϵi for i = 1, ..., n. We define the recovery log-likelihood function as i=1 log pθ(xi| xi). (6) The term recovery indicates that we attempt to recover the clean sample xi from the noisy sample xi. Thus, instead of maximizing L(θ) in equation 2, we can maximize J (θ), whose distributions are easier to sample from. Specifically, we generate synthesized samples by K steps of Langevin dynamics that iterates xτ+1 = xτ + δ2 2 ( xfθ(xτ) + 1 σ2 ( x xτ)) + δϵτ. (7) The model is then updated following the same learning gradients as MLE (equation 3), because the quadratic term 1 2σ2 x x 2 is not related to θ. Following the classical analysis of MLE, we can show that the point estimate given by maximizing recovery likelihood is an unbiased estimator of the true parameters, which means that given enough data, a rich enough model and exact synthesis, maximizing the recovery likelihood learns θ such that pdata(x) = pθ(x). See Appendix A.2 for a theoretical explanation. 3.3 NORMAL APPROXIMATION TO CONDITIONAL When the variance of perturbed noise σ2 is small, pθ(x| x) can be approximated by a normal distribution via a first order Taylor expansion at x. Specifically, the negative conditional energy is Eθ(x| x) = fθ(x) 1 2σ2 x x 2 (8) .= fθ( x) + xfθ( x), x x 1 2σ2 x x 2 (9) 2σ2 x ( x + σ2 xfθ( x)) 2 + c, (10) where c include terms irrelevant of x (see Appendix A.3 for a detailed derivation). In the above approximation, we do not perform second order Taylor expansion because σ2 is small, and x x 2/2σ2 will dominate all the second order terms from Taylor expansion. Thus we can approximate pθ(x| x) by a Gaussian approximation epθ(x| x): epθ(x| x) = N x; x + σ2 xfθ( x), σ2 . (11) We can sample from this distribution using: xgen = x + σ2 xfθ( x) + σϵ, (12) where ϵ N(0, I). This resembles a single step of Langevin dynamics, except that σϵ is replaced by 2σϵ in Langevin dynamics. This normal approximation has two traits: (1) it verifies the fact that the conditional density pθ(x| x) can be generally easier to sample from when σ is small; (2) it provides hints of choosing the step size of Langevin dynamics, as discussed in section 3.5. Published as a conference paper at ICLR 2021 3.4 CONNECTION TO VARIATIONAL INFERENCE AND SCORE MATCHING The normal approximation to the conditional distribution leads to a natural connection to diffusion probabilistic models (Sohl-Dickstein et al., 2015; Ho et al., 2020) and denoising score matching (Vincent, 2011; Song & Ermon, 2019; 2020; Saremi et al., 2018; Saremi & Hyvarinen, 2019). Specifically, in diffusion probabilistic models, instead of modeling pθ(x) as an energy-based model, it recruits variational inference and directly models the conditional density as pθ(x| x) N x + σ2sθ( x), σ2 , (13) which is in agreement with the normal approximation (equation 11), with sθ(x) = xfθ(x). On the other hand, the training objective of denoising score matching is to minimize 1 2σ2 Ep(x, x)[ x ( x + σ2sθ( x)) 2], (14) where sθ(x) is the score of the density of x. This objective is in agreement with the objective of maximizing log-likelihood of the normal approximation (equation 10), except that for normal approximation, xfθ( ) is the score of density of x, instead of x. However, the difference between the scores of density of x and x is of O(σ2), which is negligible when σ is sufficiently small (see Appendix A.4 for details). We can further show that the learning gradient of maximizing log-likelihood of the normal approximation is approximately the same as the learning gradient of maximizing the original recovery log-likelihood with one step of Langevin dynamics (see Appendix A.5). It indicates that the training process of maximizing recovery likelihood agrees with the one of diffusion probabilistic models and denoising score matching when σ is small. As the normal approximation is accurate only when σ is small, it requires many time steps in the diffusion process for this approximation to work well, which is also reported in Ho et al. (2020) and Song & Ermon (2020). In contrast, the diffusion recovery likelihood framework can be more flexible in choosing the number of time steps and the magnitude of σ. 3.5 DIFFUSION RECOVERY LIKELIHOOD As we discuss, sampling from pθ(x| x) becomes simple only when σ is small. In the extreme case when σ , pθ(x| x) converges to the marginal distribution pθ(x), which is again highly multimodal and difficult to sample from. To keep σ small and meanwhile equip the model with the ability to generate new samples initialized from white noise, inspired by Sohl-Dickstein et al. (2015) and Ho et al. (2020), we propose to learn a sequence of recovery likelihoods, on gradually perturbed observed data based on a diffusion process. Specifically, assume a sequence of perturbed observations x0, x1, ..., x T such that x0 pdata(x); xt+1 = q 1 σ2 t+1xt + σt+1ϵt+1, t = 0, 1, ...T 1. (15) The scaling factor q 1 σ2 t+1 ensures that the sequence is a spherical interpolation between the observed sample and Gaussian white noise. Let yt = q 1 σ2 t+1xt, and we assume a sequence of conditional EBMs pθ(yt|xt+1) = 1 Zθ,t(xt+1) exp fθ(yt, t) 1 2σ2 t+1 xt+1 yt 2 , t = 0, 1, ..., T 1, (16) where fθ(yt, t) is defined by a neural network conditioned on t. We follow the learning algorithm in section 3.2. A question is how to determine the step size schedule δt of Langevin dynamics. Inspired by the sampling procedure of the normal approximation (equation 12), we set the step size δt = bσt, where b < 1 is a tuned hyperparameter. This schedule turns out to work well in practice. Thus the K steps of Langevin dynamics iterates yτ+1 t = yτ t + b2σ2 t 2 ( yfθ(yτ t , t) + 1 σ2 t (xt+1 yτ t )) + bσtϵτ. (17) Algorithm 1 summarizes the training procedure. After training, we initialize the MCMC sampling from Gaussian white noise, and the synthesized sample at each time step serves to initialize the MCMC that samples from the model of the previous time step. See algorithm 2. To show the efficacy of our method, Figures 3 and 2 display several 2D toy examples learned by diffusion recovery likelihood. Published as a conference paper at ICLR 2021 Algorithm 1 Training Sample t Unif({0, ..., T 1}). Sample pairs (yt, xt+1). Set synthesized sample y t = xt+1. for τ 1 to K do Update y t according to equation 17. end for Update θ following the gradients θfθ(yt, t) θfθ(y t , t). until converged. Algorithm 2 Progressive sampling Sample x T N(0, I). for t T 1 to 0 do yt = xt+1. for τ 1 to K do Update yt according to equation 17. end for xt = yt/ q 1 σ2 t+1. end for return x0. 4 EXPERIMENTS To show that diffusion recovery likelihood is flexible for diffusion process of various magnitudes of noise, we test the method under two settings: (1) T = 6, with K = 30 steps of Langevin dynamic per time step; (2) T = 1000, with sampling from normal approximation. (2) resembles the noise schedule of Ho et al. (2020) and the magnitude of noise added at each time step is much smaller compared to (1). For both settings, we set σ2 t to increase linearly. The network structure of fθ(x, t) is based on Wide Res Net (Zagoruyko & Komodakis, 2016) and we remove weight normalization. t is encoded by Transformer sinusoidal position embedding as in (Ho et al., 2020). For (1), we find that adding another scaling factor ct to the step size δt helps. Architecture and training details are in Appendix B. Henceforth we simply refer the two settings as T6 and T1k. 4.1 IMAGE GENERATION Figures 1 and 4 display uncurated samples generated from learned models on CIFAR-10, Celeb A 64 64, LSUN 64 64 and 128 128 datasets under T6 setting. The samples are of high fidelity and comparable to GAN-based methods. Appendix C.5 provides more generated samples. Tables 1 and 3 summarize the quantitative evaluations on CIFAR-10 and Celeb A datasets, in terms of Frechet Inception Distance (FID) (Heusel et al., 2017) and inception scores (Salimans et al., 2016). On CIFAR-10, our model achieves FID 9.58 and inception score 8.30, which outperforms existing methods of learning explicit energy-based models to a large extent, and is superior to a majority of GAN-based methods. On Celeb A, our model obtains results comparable with the state-of-the-art GAN-based methods, and outperforms score-based methods (Song & Ermon, 2019; 2020). Note that the score-based methods (Song & Ermon, 2019; 2020) and diffusion probabilistic models (Ho et al., 2020) directly parametrize and learn the score of data distribution, whereas our goal is to learn explicit energy-based models. Figure 4: Generated samples on unconditional CIFAR-10 (left) and LSUN 642 church outdoor (center) and LSUN 642 bedroom (right). Published as a conference paper at ICLR 2021 Table 1: FID and inception scores on CIFAR-10. Model FID Inception WGAN-GP (Gulrajani et al., 2017) 36.4 7.86 .07 SNGAN (Miyato et al., 2018) 21.7 8.22 .05 SNGAN-DDLS (Che et al., 2020) 15.42 9.09 .10 Style GAN2-ADA (Karras et al., 2020) 3.26 9.74 .05 Score-based NCSN (Song & Ermon, 2019) 25.32 8.87 .12 NCSN-v2 (Song & Ermon, 2020) 10.87 8.40 .07 DDPM (Ho et al., 2020) 3.17 9.46 .11 Explicit EBM-conditional Coop Nets (Xie et al., 2019) - 7.30 EBM-IG (Du & Mordatch, 2019) 37.9 8.30 JEM (Grathwohl et al., 2019) 38.4 8.76 Explicit EBM Muli-grid (Gao et al., 2018) 40.01 6.56 Coop Nets (Xie et al., 2016a) 33.61 6.55 EBM-SR (Nijkamp et al., 2019b) - 6.21 EBM-IG (Du & Mordatch, 2019) 38.2 6.78 Ours (T6) 9.58 8.30 .11 Table 2: Ablation of training objectives, time steps T and sampling steps K on CIFAR-10. K = 0 indicates that we sample from the normal approximation. Setting / Objective FID Inception T = 1, K = 180 32.12 6.72 0.12 T = 1000, K = 0 22.58 7.71 0.08 T = 1000, K = 0 (DSM) 21.76 7.76 0.11 T = 6, K = 10 - - T = 6, K = 30 9.58 8.30 0.11 T = 6, K = 50 9.36 8.46 0.13 Table 3: FID scores on Celeb A 642. QA-GAN (Parimala & Channappayya, 2019) 6.42 COCO-GAN (Lin et al., 2019) 4.0 NVAE (Vahdat & Kautz, 2020) 14.74 NCSN (Song & Ermon, 2019) 25.30 NCSN-v2 (Song & Ermon, 2020) 10.23 EBM-SR (Nijkamp et al., 2019b) 23.02 EBM-Triangle (Han et al., 2020) 24.70 Ours (T6) 5.98 Figure 5: Interpolation results between the leftmost and rightmost generated samples. For top to bottom: LSUN church outdoor 1282, LSUN bedroom 1282 and Celeb A 642. Table 4: Test bits per dimension on CIFAR10. indicates that we estimate the bit per dimension with the approximated log partition function instead of analytically computing it. See section 4.2. DDPM (Ho et al., 2020) 3.70 Glow (Kingma & Dhariwal, 2018) 3.35 Flow++ (Ho et al., 2019) 3.08 GPixel CNN (Van den Oord et al., 2016) 3.03 Sparse Transformer (Child et al., 2019) 2.80 Dist Aug (Jun et al., 2020) 2.56 Ours (T1k) 3.18 Figure 6: Image inpainting on LSUN church outdoor 1282 (left) and Celeb A 642 (right). With each block, the top row are mask images while the bottom row are inpainted images. Interpolation. As shown in Figure 5, our model is capable of smooth interpolation between two generated samples. Specifically, for two samples x(0) 0 and x(1) 0 , we do a sphere interpolation between Published as a conference paper at ICLR 2021 the initial white noise images x(0) T and x(1) T and the noise terms of Langevin dynamics ϵ(0) t,τ and ϵ(1) t,τ for every sampling step at every time step. More interpolation results can be found in Appendix C.3. Image inpainting. A promising application of energy-based models is to use the learned model as a prior model for image processing, such as image inpainting, denoising and super-resolution (Gao et al., 2018; Du & Mordatch, 2019; Song & Ermon, 2019). In Figure 6, we demonstrate that the learned models by maximizing recovery likelihoods are capable of realistic and semantically meaningful image inpainting. Specifically, given a masked image and the corresponding mask, we first obtain a sequence of perturbed masked images at different noise levels. The inpainting can be easily achieved by running Langevin dynamics progressively on the masked pixels while keeping the observed pixels fixed at decreasingly lower noise levels. Additional image inpainting results can be found in Appendix C.4. Ablation study. Table 2 summarizes the results of ablation study on CIFAR-10. We investigate the effect of changing the numbers of time steps T and sampling steps K. First, to show that it is beneficial to learn by diffusion recovery likelihood, we compare against a baseline approach (T = 1, K = 180) where we use only one time step, so that the recovery likelihood becomes marginal likelihood. The approach is adopted by Nijkamp et al. (2019b) and Du & Mordatch (2019). For fair comparison, we equip the baseline method the same budget of MCMC sampling as our T6 setting (i.e., 180 sampling steps). Our method outperforms this baseline method by a large margin. Also the models are trained more efficiently as the number of sampling steps per iteration is reduced and amortized by time steps. Next, we report the sample quality of setting T1k. We test two training objectives for this setting: (1) maximizing recovery likelihoods (T = 1000, K = 0) and (2) maximizing the approximated normal distributions (T=1000, K=0 (DSM)). As mentioned in section 3.4, (2) is equivalent to the training objectives of denoising score matching (Song & Ermon, 2019; 2020) and diffusion probabilistic model (Ho et al., 2020), except that the score functions are taken as the gradients of explicit energy functions. In practice, for a direct comparison, (2) follows the same implementation as in Ho et al. (2020), except that the score function is parametrized as the gradients of the explicit energy function used in our method. (1) and (2) achieve similar sample quality in terms of quantitative metrics, where (2) results in a slightly better FID score yet a slightly worse inception score. This verifies the fact that the training objectives of (1) and (2) are consistent. Both (1) and (2) performs worse than setting T6. A possible explanation is that the sampling error may accumulate with many time steps, so that a more flexible schedule of time steps accompanied with certain amount of sampling steps is preferred. Last, we examine the influence of varying the number of sampling steps while fixing the number of time steps. The training becomes unstable when the number of sampling steps are not enough (T = 6, K = 10), and more sampling steps lead to better sample quality. However, since K = 50 does not gain significant improvement versus K = 30, yet of much higher computational cost, we keep K = 30 for image generation on all datasets. See Appendix C.1 for a plot of FID scores over iterations. 4.2 LONG-RUN CHAIN ANALYSIS Besides achieving high quality generation, a perhaps equally important aspect of learning EBMs is to obtain a faithful energy potential. A principle way to check the validity of the learned potential is to perform long-run sampling chains and see if the samples still remain realistic. However, as pointed out in Nijkamp et al. (2019a), almost all existing methods of learning EBMs fail in getting realistic long-run chain samples. In this subsection, we demonstrate that by composing a thousand diffusion time steps (T1k setting), we can form steady long-run MCMC chains for the conditional distributions. First we prepare a faithful sampler for conducting long-run sampling. Specifically, after training the model under T1k setting by maximizing diffusion recovery likelihood, for each time step, we first sample from the normal approximation and count it as one sampling step, and then use Hamiltonian Monte Carlo (HMC) (Neal et al., 2011) with 2 leapfrog steps to perform the consecutive sampling steps. To obtain a reasonable schedule of sampling step size, for each time step we adaptively adjust the step size of HMC to make the average acceptance rate range in [0.6, 0.9], which is computed Published as a conference paper at ICLR 2021 Figure 7: Left: Adjusted step size of HMC over time step. Center: Acceptance rate over time step. Right: Estimated log partition function over number of samples with different number of sampling steps per time step. The x axis is plotted in log scale. over 1000 chains for 100 steps. Figure 7 displays the adjusted step size (left) and acceptance rate (center) over time step. The adjusted step size increases logarithmically. With this step size schedule, we generate long-run chains from the learned sequence of conditional distributions. As shown in Figure 8, images remain realistic for even 100k sampling steps in total (i.e., 100 sampling steps per time step), resulting in FID 24.89. This score is close to the one computed on samples generated by 1k steps (i.e., sampled from normal approximation), which is 25.12. As a further check, we recruit a No-U-Turn Sampler (Hoffman & Gelman, 2014) with the same step size schedule as HMC to perform long-run sampling, where the samples also remain realistic. See Appendix C.2 for details. Figure 8: Long-run chain samples from model-T1k with different total amount of HMC steps. From left to right: 1k steps, 10k steps and 100k steps. More interestingly, given the faithful long-run MCMC samples from the conditional distributions, we can estimate the log ratio of the partition functions of the marginal distributions, and further estimate the partition function of pθ(y0). The strategy is based on annealed importance sampling (Neal, 2001). See Appendix A.6 for the implementation details. The right subfigure of Figure 7 depicts the estimated log partition function of pθ(y0) over the number of MCMC samples used. To verify the estimation strategy and again check the long-run chain samples, we conduct multiple runs using samples generated with different numbers of HMC steps and display the estimation curves. All the curves saturate to values close to each other at the end, indicating the stability of long-run chain samples and the effectiveness of the estimation strategy. With the estimated partition function, by change of variable, we can estimate the normalized density of data as gθ(x0) = p 1 σ2 1pθ( p 1 σ2 1x0). We report test bits per dimension on CIFAR-10 in Table 4. Note that the result should be taken with a grain of salt, because the partition function is estimated by samples and as shown in Appendix A.6, it is a stochastic lower bound of the true value, that will converge to the true value when the number of samples grows large. 5 CONCLUSION We propose to learn EBMs by diffusion recovery likelihood, a variant of MLE applied to diffusion processes. We achieve high quality image synthesis, and with a thousand noise levels, we obtain faithful long-run MCMC samples that indicate the validity of the learned energy potentials. Since this method can learn EBMs efficiently with small budget of MCMC, we are also interested in scaling it up to higher resolution images and investigating this method in other data modalities in the future. ACKNOWLEDGEMENT The work was done while Ruiqi Gao and Yang Song were interns at Google Brain during the summer of 2020. The work of Ying Nian Wu is supported by NSF DMS-2015577. We thank Alexander A. Alemi, Jonathan Ho, Tim Salimans and Kevin Murphy for their insightful discussions during the course of this project. Published as a conference paper at ICLR 2021 Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-encoders as generative models. In Advances in neural information processing systems, pp. 899 907, 2013. Tong Che, Ruixiang Zhang, Jascha Sohl-Dickstein, Hugo Larochelle, Liam Paull, Yuan Cao, and Yoshua Bengio. Your gan is secretly an energy-based model and you should use discriminator driven latent sampling. ar Xiv preprint ar Xiv:2003.06060, 2020. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. ar Xiv preprint ar Xiv:1904.10509, 2019. Guillaume Desjardins, Yoshua Bengio, and Aaron C Courville. On tracking the partition function. In Advances in neural information processing systems, pp. 2501 2509, 2011. Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. ar Xiv preprint ar Xiv:1903.08689, 2019. Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. ar Xiv preprint ar Xiv:1611.03852, 2016. Ruiqi Gao, Yang Lu, Junpei Zhou, Song-Chun Zhu, and Ying Nian Wu. Learning generative convnets via multi-grid modeling and sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9155 9164, 2018. Ruiqi Gao, Erik Nijkamp, Diederik P Kingma, Zhen Xu, Andrew M Dai, and Ying Nian Wu. Flow contrastive estimation of energy-based models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7518 7528, 2020. Mark Girolami and Ben Calderhead. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2): 123 214, 2011. Anirudh Goyal Alias Parth Goyal, Nan Rosemary Ke, Surya Ganguli, and Yoshua Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. In Advances in Neural Information Processing Systems, pp. 4392 4402, 2017. Will Grathwohl, Kuan-Chieh Wang, J orn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. ar Xiv preprint ar Xiv:1912.03263, 2019. Will Grathwohl, Kuan-Chieh Wang, Jorn-Henrik Jacobsen, David Duvenaud, and Richard Zemel. Cutting out the middle-man: Training and evaluating energy-based models without sampling. ar Xiv preprint ar Xiv:2002.05616, 2020. Roger B Grosse, Siddharth Ancha, and Daniel M Roy. Measuring the reliability of mcmc inference with bidirectional monte carlo. In Advances in Neural Information Processing Systems, pp. 2451 2459, 2016. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767 5777, 2017. Tian Han, Erik Nijkamp, Linqi Zhou, Bo Pang, Song-Chun Zhu, and Ying Nian Wu. Joint training of variational auto-encoder and latent energy-based model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7978 7987, 2020. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626 6637, 2017. Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving flowbased generative models with variational dequantization and architecture design. ar Xiv preprint ar Xiv:1902.00275, 2019. Published as a conference paper at ICLR 2021 Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. ar Xiv preprint ar Xiv:2006.11239, 2020. Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15(1):1593 1623, 2014. Long Jin, Justin Lazarow, and Zhuowen Tu. Introspective classification with convolutional nets. In Advances in Neural Information Processing Systems, pp. 823 833, 2017. Heewoo Jun, Rewon Child, Mark Chen, John Schulman, Aditya Ramesh, Alec Radford, and Ilya Sutskever. Distribution augmentation for generative modeling. In Proceedings of Machine Learning and Systems 2020, pp. 10563 10576. 2020. Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. ar Xiv preprint ar Xiv:2006.06676, 2020. Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation. ar Xiv preprint ar Xiv:1606.03439, 2016. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215 10224, 2018. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. Rithesh Kumar, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. Maximum entropy generators for energy-based models. ar Xiv preprint ar Xiv:1901.08508, 2019. Justin Lazarow, Long Jin, and Zhuowen Tu. Introspective neural networks for generative modeling. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2774 2783, 2017. Yann Le Cun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006. Kwonjoon Lee, Weijian Xu, Fan Fan, and Zhuowen Tu. Wasserstein introspective neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3702 3711, 2018. Chieh Hubert Lin, Chia-Che Chang, Yu-Sheng Chen, Da-Cheng Juan, Wei Wei, and Hwann-Tzong Chen. Coco-gan: generation by parts via conditional coordinating. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4512 4521, 2019. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15:2018, 2018. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018. Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125 139, 2001. Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011. Jiquan Ngiam, Zhenghao Chen, Pang W Koh, and Andrew Y Ng. Learning deep energy models. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 1105 1112, 2011. Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, and Ying Nian Wu. On the anatomy of mcmcbased maximum likelihood learning of energy-based models. ar Xiv preprint ar Xiv:1903.12370, 2019a. Published as a conference paper at ICLR 2021 Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. On learning non-convergent shortrun mcmc toward energy-based model. ar Xiv preprint ar Xiv:1904.09770, 2019b. KANCHARLA Parimala and Sumohana and Channappayya. Quality aware generative adversarial networks. In Advances in Neural Information Processing Systems, pp. 2948 2958, 2019. Yixuan Qiu, Lingsong Zhang, and Xiao Wang. Unbiased contrastive divergence algorithm for training energy-based latent variable models. In International Conference on Learning Representations, 2019. Benjamin Rhodes, Kai Xu, and Michael U Gutmann. Telescoping density-ratio estimation. Advances in Neural Information Processing Systems, 33, 2020. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234 2242, 2016. Saeed Saremi and Aapo Hyvarinen. Neural empirical bayes. Journal of Machine Learning Research, 20:1 23, 2019. Saeed Saremi, Arash Mehrjou, Bernhard Sch olkopf, and Aapo Hyv arinen. Deep energy estimator networks. ar Xiv preprint ar Xiv:1805.08306, 2018. Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. ar Xiv preprint ar Xiv:1503.03585, 2015. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11918 11930, 2019. Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. ar Xiv preprint ar Xiv:2006.09011, 2020. Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. Advances in Neural Information Processing Systems, 33, 2020. Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790 4798, 2016. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011. Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, and Ying Nian Wu. Cooperative training of descriptor and generator networks. ar Xiv preprint ar Xiv:1609.09408, 2016a. Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. In International Conference on Machine Learning, pp. 2635 2644, 2016b. Jianwen Xie, Zilong Zheng, Xiaolin Fang, Song-Chun Zhu, and Ying Nian Wu. Cooperative training of fast thinking initializer and slow thinking solver for multi-modal conditional learning. ar Xiv preprint ar Xiv:1902.02812, 2019. Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. ar Xiv preprint ar Xiv:1506.03365, 2015. Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016. Junbo Zhao, Michael Mathieu, and Yann Le Cun. Energy-based generative adversarial network. ar Xiv preprint ar Xiv:1609.03126, 2016. Published as a conference paper at ICLR 2021 A EXTENDED DERIVATIONS A.1 DERIVATION OF EQUATION 5 Let x = x + σϵ, where ϵ N(0, I). Given the marginal distribution of Zθ exp(fθ(x)), (18) We can derive the conditional distribution of x given x as pθ(x| x) = pθ(x)p( x|x)/p( x) (19) Zθ exp(fθ(x)) 1 (2πσ2) n 2 exp( 1 2σ2 x x 2)/p( x) (20) = 1 Zθ( x) exp fθ(x) 1 2σ2 x x 2 , (21) where we absorb all the terms that are irrelevant of x as Zθ( x). A.2 THEORETICAL UNDERSTANDING In this subsection, we analyze the asymptotic behavior of maximizing the recovery log-likelihood. For model class {pθ(x), θ}, suppose there exists θ such that pdata = pθ . According to the classical theory of MLE, let ˆθ0 be the point estimate by MLE. Then we have ˆθ is an unbiased estimator of θ with asymptotic normality: n(ˆθ0 θ ) N(0, I0(θ ) 1), (22) where I0(θ) = Ex pθ[ 2 θ log pθ(x)] is the Fisher information, and n is the number of observed samples. Let ˆθ be the point estimate given by maximizing recovery log-likelihood, we can derive a result in parallel to that of MLE: n(ˆθ θ ) N(0, I(θ ) 1), (23) where I(θ) = Epθ(x, x)[ 2 θ log pθ(x| x)]. The relationship between I0(θ) and I(θ) is that I0(θ) = I(θ) + Epθ(x, x)[ 2 θ log pθ( x)]. (24) Thus there is loss of information, but ˆθ is still an unbiased estimator of θ with asymptotic normality. A.3 DETAILED DERIVATION OF NORMAL APPROXIMATION Eθ(x| x) = fθ(x) 1 2σ2 x x 2 (25) .= fθ( x) + xfθ( x), x x 1 2σ2 x x 2 (26) 2σ2 x 2 2 x, x + x 2 + xfθ( x), x xfθ( x), x + fθ( x) (27) 2σ2 x 2 2 x + σ2 xfθ( x), x 1 2σ2 x 2 xfθ( x), x + fθ( x) (28) 2σ2 x ( x + σ2 xfθ( x)) 2 + c, (29) A.4 DIFFERENCE BETWEEN THE SCORES OF p(x) AND p( x) For notation clarity, with x = x + ϵ, we let ep be the distribution of x, and p be the distribution of x. Then for a smooth testing function with vanishing tails, E[h( x)] = E[h(x + ϵ)] (30) .= E[h(x) + h (x)ϵ + h (x)ϵ2/2] (31) = E[h(x)] + E[h (x)]σ2/2. (32) Published as a conference paper at ICLR 2021 Integral by parts, E[h (x)] = Z h (x)p(x)dx = Z h (x)p (x)dx = Z p (x)h(x)dx. (33) Thus we have the heat equation ep(x) = p(x) + p (x)σ2/2. (34) x log p(x) = x log p(x) + x log(1 + p (x)/p(x)σ2/2) (35) .= x log p(x) + x[p (x)/p(x)]σ2/2. (36) Thus the difference between the score of p and ep is of the order σ2, which is negligible when σ2 is small. A.5 LEARNING GRADIENTS OF NORMAL APPROXIMATION AND ORIGINAL RECOVERY LIKELIHOOD In this subsection we demonstrate that the learning gradient of maximizing likelihood of the normal approximation is approximately the same as the gradient of maximizing the original recovery likelihood with one step of Langevin sampling. Specifically, the gradient of the normal approximation of recovery log-likelihood for an observed xobs is 2σ2 xobs ( x + σ2f θ( x)) 2 = θf θ( x)(xobs ( x + σ2f θ( x)). (37) On the other hand, to maximize the original recovery likelihood, suppose we sample xsyn pθ(x| x), then the gradient ascent of the original recovery log-likelihood is θfθ(xobs) E[ θfθ(xsyn)] = hθ(xobs) E[hθ(xsyn)], (38) where hθ(x) = θfθ(x). Approximately, if we perform one step of Langevin dynamics from x to obtain xsyn, i.e., xsyn = x + σ2f θ( x) + 2σe, and assume fθ(x) is locally linear in x, then θfθ(xobs) E[ θfθ(xinit)] (39) = hθ(xobs) E[hθ( x + σ2f θ( x) + σe)] (40) .= hθ( x) + h θ( x)(xobs x) E[hθ( x) + h θ( x)(σ2f θ( x) + σe)] (41) = h θ( x)(xobs ( x + σ2f θ( x)) (42) = θf θ( x)(xobs ( x + σ2f θ( x)). (43) Comparing equations 37 and 43, we see that the two gradients agree with each other. A.6 ESTIMATING THE PARTITION FUNCTION We can utilize the sequence of learned distributions of yt (= q 1 σ2 t+1xt) to estimate the partition function. Specifically, the marginal distribution of yt is pθ(yt) = 1 Zθ,t exp (fθ(yt, t)) (44) We can estimate the ratio of the partition functions at two consecutive time steps using importance sampling Zθ,t Zθ,t+1 = Epθ(yt+1) [exp(fθ(y, t) fθ(y, t + 1))] (45) i=1 [exp(fθ(yt+1,i, t) fθ(yt+1,i, t + 1))] , (46) Published as a conference paper at ICLR 2021 where yt+1,i are samples generated by progressive sampling. Starting from t = T, where p T (x) follows Gaussian distribution, we can compute log Zθ,t along the reverse path of the diffusion process, until we reach t = 0: Zθ,0 = Zθ,T Zθ,t Zθ,t+1 . (47) In practice, since the ratio given by MCMC samples can vary across many orders of magnitude, it is more meaningful to estimate log Zθ,0 = log Zθ,T + t=0 log Zθ,t Zθ,t+1 . (48) Unfortunately, although equation 46 is an unbiased estimator of Zθ,t/Zθ,t+1, the logarithm of this estimator is generally a stochastic lower bound of log(Zθ,t/Zθ,t+1) (Grosse et al., 2016). However, as we show below, this bound will gradually converge to an unbiased estimator of log(Zθ,t/Zθ,t+1), as the number of samples becomes large. Specifically, let A be the estimator in equation 46, µ be the true value of Zθ,t/Zθ,t+1. We have E[A] = µ, then by second order Taylor expansion, E[log A] .= E log µ + 1 µ(A µ) 1 2µ2 (A µ)2 (49) = log µ 1 2µ2 Var(A). (50) By law of large number, Var(A) 0 as M , and thus E[log A] log µ. This is also consistent with the estimation curves in the right subfigure of Figure 7: since Var(A) 0, the estimation curve increases from below as the number of samples becomes larger. When the curve becomes stable, it indicates the convergence. B EXPERIMENTAL DETAILS Model architecture. Our network structure is based on Wide Res Net (Zagoruyko & Komodakis, 2016). Table 5 lists the detailed network structures of various resolutions. The number of Res Blocks at every level N is a hyperparameter that we sweep over. The values of N for various datasets are listed in Table 6. Each Res Block consists of two Conv2D layers. For the second Conv2D layer, we use zero initialization for the weights, and add a trainable channel-wise scaling parameter to the output. We remove the weight normalization, and use leaky Re LU (slope = 0.2) as the activation function in Res Blocks. Spectral normalization (Miyato et al., 2018) is used to regularize parameters in Conv2D layer, Res Blocks and Dense layer. For encoding time step t, we follow the scheme in (Ho et al., 2020). Specifically, the time step t is first transformed into sinusoidal embedding, and then two Dense layers is added. The time embedding is added after the first Conv2D layer of each Res Block. Training. We use Adam (Kingma & Ba, 2014) optimizer for all the experiments. We find that for high resolution images, using a smaller β1 in Adam help stabilize training. We use learning rate 0.0001 for all the experiments. For the values of β1, batch sizes and the number of training iterations for various datasets, see Table 6. Datasets. We use the following datasets in our experiments: CIFAR-10 (Krizhevsky et al., 2009), Celeb A (Liu et al., 2018) and LSUN (Yu et al., 2015). CIFAR-10 is of resolution 32 32, and contains 50, 000 training images and 10, 000 test images. Celeb A contains 202,599 face images, of which 162,770 are training images and 19,962 are test images. For processing, we first clip each image to 178 178 and then resize it to 64 64. For LSUN, we use church outdoor and bedroom categories, which contains 126,227 and 3,033,042 training images respectively. Both categories contain 300 test images. For processing, we first crop each image to a square image of the smaller size among the height and weight, and then we resize it to 64 64 or 128 128. For resizing, we set antialias to True. We apply horizontal random flip as data augmentation for all datasets during training. Published as a conference paper at ICLR 2021 Evaluation metrics. We use FID and inception scores as quantitative evaluation metrics of sample quality. On all the datasets, we calculate FID and inception scores on 50,000 samples using the original code from Salimans et al. (2016) and Heusel et al. (2017). Table 5: Model architectures of various solutions. N is a hyperparameter that we sweep over. (a) Resolution 32 32 3 3 Conv2D, 128 N Res Blocks, 128 Downsample 2 2 N Res Blocks, 256 Downsample 2 2 N Res Blocks, 256 Downsample 2 2 N Res Blocks, 256 Re LU, global sum Dense 1 (b) Resolution 64 64 3 3 Conv2D, 128 N Res Blocks, 128 Downsample 2 2 N Res Blocks, 256 Downsample 2 2 N Res Blocks, 256 Downsample 2 2 N Res Blocks, 256 Downsample 2 2 N Res Blocks, 512 Re LU, global sum Dense 1 (c) Resolution 128 128 3 3 Conv2D, 128 N Res Blocks, 128 Downsample 2 2 N Res Blocks, 256 Downsample 2 2 N Res Blocks, 256 Downsample 2 2 N Res Blocks, 256 Downsample 2 2 N Res Blocks, 512 Downsample 2 2 N Res Blocks, 512 Re LU, global sum Dense 1 (d) Time embedding (temb) sinusoidal embedding Dense, leaky Re LU (e) Res Block leaky Re LU, 3 3 Conv2D + Dense(leaky Re LU(temb)) leaky Re LU, 3 3 Conv2D Table 6: Hyperparameters of various datasets. Dataset N β1 in Adam Batch size Training iterations CIFAR-10 8 0.9 256 240k Celeb A 6 0.5 128 880k LSUN church outdoor 642 2 0.9 128 960k LSUN bedroom 642 2 0.9 128 760k LSUN church outdoor 1282 2 0.5 64 840k LSUN bedroom 1282 5 0.5 64 580k C ADDITIONAL EXPERIMENTAL RESULTS C.1 FID SCORES OVER ITERATIONS Figure 9 demonstrates FID scores computed on 2,500 samples every 15,000 iterations. C.2 LONG-RUN CHAIN SAMPLING WITH NUTS As a further check, we use a No-U-Turn Sampler (Hoffman & Gelman, 2014) to perform the longrun chain sampling, with the same step size schedule obtained for HMC sampler. Figure 10 displays samples with different number of sampling steps. The samples remain realistic after 100k sampling steps in total and the FID score remains stable. Published as a conference paper at ICLR 2021 Figure 9: FIDs for different number of Langevin steps. (a) 1k steps, FID=24.78 (b) 10k steps, FID=23.89 (c) 100k steps, FID=25.08 Figure 10: Long run chain samples with different total number of NUTS steps. C.3 ADDITIONAL INTERPOLATION RESULTS Figures 11, 12 and 13 display more examples of interpolation between two generated samples on Celeb A 642, LSUN church outdoor 1282 and LSUN bedroom 1282. Figure 11: Interpolation results between the leftmost and rightmost generated samples on Celeb A 64 64. Published as a conference paper at ICLR 2021 Figure 12: Interpolation results between the leftmost and rightmost generated samples on LSUN church outdoor 128 128. Figure 13: Interpolation results between the leftmost and rightmost generated samples on LSUN bedroom 128 128. C.4 ADDITIONAL IMAGE INPAINTING RESULTS Figures 14 and 15 show additional examples of image inpainting on Celeb A 642 and LSUN church outdoor 1282. C.5 ADDITIONAL UNCURATED SAMPLES Figures 16, 17, 18, 19, 20 and 21 show uncurated samples from the learned models under T6 setting on CIFAR-10, Celeb A 642, LSUN church outdoor 1282, LSUN bedroom 1282, LSUN church outdoor 642 and LSUN bedroom 642 datasets. Published as a conference paper at ICLR 2021 Figure 14: Image inpainting results on Celeb A 64 64. Top: masked images, bottom: inpainted images. Figure 15: Image inpainting results on LSUN church outdoor 128 128. Top: masked images, bottom: inpainted images. Published as a conference paper at ICLR 2021 Figure 16: Generated samples on CIFAR-10. Published as a conference paper at ICLR 2021 Figure 17: Generated samples on Celeb A 64 64. Published as a conference paper at ICLR 2021 Figure 18: Generated samples on LSUN church outdoor 128 128. FID=9.76 Published as a conference paper at ICLR 2021 Figure 19: Generated samples on LSUN bedroom 128 128. FID=11.27 Published as a conference paper at ICLR 2021 Figure 20: Generated samples on LSUN church outdoor 64 64. FID=7.02 Published as a conference paper at ICLR 2021 Figure 21: Generated samples on LSUN bedroom 64 64. FID=8.98