# learning_energybased_model_via_dualmcmc_teaching__6d767b2f.pdf Learning Energy-based Model via Dual-MCMC Teaching Jiali Cui, Tian Han Department of Computer Science, Stevens Institute of Technology {jcui7,than6}@stevens.edu This paper studies the fundamental learning problem of the energy-based model (EBM). Learning the EBM can be achieved using the maximum likelihood estimation (MLE), which typically involves the Markov Chain Monte Carlo (MCMC) sampling, such as the Langevin dynamics. However, the noise-initialized Langevin dynamics can be challenging in practice and hard to mix. This motivates the exploration of joint training with the generator model where the generator model serves as a complementary model to bypass MCMC sampling. However, such a method can be less accurate than the MCMC and result in biased EBM learning. While the generator can also serve as an initializer model for better MCMC sampling, its learning can be biased since it only matches the EBM and has no access to empirical training examples. Such biased generator learning may limit the potential of learning the EBM. To address this issue, we present a joint learning framework that interweaves the maximum likelihood learning algorithm for both the EBM and the complementary generator model. In particular, the generator model is learned by MLE to match both the EBM and the empirical data distribution, making it a more informative initializer for MCMC sampling of EBM. Learning generator with observed examples typically requires inference of the generator posterior. To ensure accurate and efficient inference, we adopt the MCMC posterior sampling and introduce a complementary inference model to initialize such latent MCMC sampling. We show that three separate models can be seamlessly integrated into our joint framework through two (dual-) MCMC teaching, enabling effective and efficient EBM learning. 1 Introduction Deep generative models have made significant progress in learning complex data distributions [35, 18, 37, 34, 19, 22, 5] and have found successful applications in a wide range of real-world scenarios [26, 6, 11, 15]. Among these, the energy-based model (EBM) [5, 6, 29, 8, 39, 2, 3] has gained particular interest as a flexible and expressive generative model with an energy function parameterized by a neural network. Learning the EBM can be accomplished via the maximum likelihood estimation (MLE), which involves the Markov Chain Monte Carlo (MCMC) sampling in high-dimensional data space. However, such MCMC sampling has shown to be challenging [30, 7, 33, 13], as it may take a long time to mix between different local modes with a non-informative noise initialization [43, 15]. To address this challenge, recent advances have explored employing complementary models to substitute for MCMC sampling [15, 16, 12, 17, 24]. One notable example is the generator model. The generator model incorporates a top-down generation network that is capable of mapping lowdimensional latent space to high-dimensional data space and admits efficient sample generation. The generator is learned to match the EBM so that MCMC sampling can be replaced by generator ancestral 37th Conference on Neural Information Processing Systems (Neur IPS 2023). sampling. However, such direct generator sampling has shown to be less accurate and suboptimal [43]. To alleviate this issue, [42, 43] introduced cooperative learning, where samples generated by the generator model serve as initial points, and then followed by a finite-step MCMC revision process. While this gradient-based MCMC revision process can be more accurate, the generator model learned relies solely on the EBM and has no access to the observed empirical observations. As a result, this learning scheme may render biased generator learning, which in turn caps the potential of learning a strong EBM. An effective joint learning scheme for the EBM and its complementary generator model is needed, yet still in its infancy. In this paper, we present a novel learning scheme that can seamlessly integrate the EBM and complementary models into a joint probabilistic framework. Specifically, both the EBM and complementary generator model are learned to match the empirical data distribution, while the generator model, at the same time, is also learned to match the EBM. Learning the generator model with empirical training examples can be achieved with MLE, which typically requires access to the generator posterior as an inference process. To ensure an effective and efficient inference, we employ the MCMC posterior sampling with the complementary inference model learned as an initializer. Together with MCMC sampling of EBM being initialized by the generator model, such two MCMC samplings can be further used as two MCMC revision processes that teach the generator and inference model to absorb MCMC-revised samples, thus we term our framework dual-MCMC teaching. We show that our joint framework is capable of teaching the complementary models and thus learning a strong EBM. Our contributions can be summarized as follows: We introduce a novel method that integrates the EBM and its complementary models into a joint learning scheme. We propose the use of dual-MCMC teaching for generator and inference models to facilitate efficient yet accurate sampling and inference, which in turn leads to effective EBM learning. We conduct extensive experiments to demonstrate the superior performance of our EBM. 2 Preliminary 2.1 Energy-based Model Let x RD be the high-dimensional observed examples. The energy-based model (EBM) [41, 6, 5, 29] represents data uncertainty with an undirected probability density defined as πα(x) = 1 Z(α) exp [fα(x)] , (1) where fα(x) is the energy function parameterized with parameters α, and Z(α) (= R x exp[fα(x)]dx) is the partition function or normalizing constant. Maximum likelihood estimation. The maximum likelihood estimation (MLE) is known for being an asymptotically optimal estimator and can be used for training the EBM. In particular, with observed examples, {x(i), i = 1, 2, ..., n}, the MLE learning of EBM maximizes the log-likelihood Lπ(α) = 1 n Pn i=1 log πα(x(i)). If the sample size n is large enough, the maximum likelihood estimator minimizes the KL(pd(x) πα(x)) which is the Kullback-Leibler (KL) divergence between the empirical data distribution pd(x) and the EBM distribution πα(x). The gradient αLπ(α) is computed as max α Lπ(α) = min α KL(pd(x) πα(x)), where αLπ(α) = Epd(x)[ αfα(x)] Eπα(x)[ αfα(x)] (2) Given such a gradient, the EBM can be learned via stochastic gradient ascent. Sampling from EBM. The Eqn.2 requires sampling from the EBM πα(x), which can be achieved via Markov Chain Monte Carlo (MCMC) sampling, such as the Langevin dynamics [28]. Specifically, to sample from the EBM, the Langevin dynamics iteratively updates as xτ+1 = xτ + s xτ log πα(xτ) + where τ indexes the time step, s is the step size and Uτ N(0, ID). As s 0, and τ , the distribution of xτ will converge to the target distribution πα(xτ) regardless of the initial distribution of x0 [28]. The existing practice [29, 30, 5] adopts non-informative distribution for x0, such as unit Gaussian or uniform, to initialize the Langevin transition, but it can be extremely inefficient and ineffective as they usually take a long time to converge between different modes and are also non-stable in practice [43]. The ability to generate efficient and effective samples from the model distribution becomes the key step toward training successful EBMs. In this paper, we study the complementary model, i.e., generator model, as an informative initializer for effective yet efficient MCMC exploration toward better EBM training. 2.2 Generator Model Various works [15, 12, 17, 24] have explored the use of the generator as an amortized sampler to replace the costly noise-initialized MCMC sampling for EBM training. Such a learning approach relies on samples directly drawn from complementary models, which can be less accurate than iterative MCMC sampling as it lacks a fine-grained exploration of the energy landscape. [42, 43] propose the cooperative scheme in which MCMC sampling of EBM is initialized by the generated samples from the generator model. However, the generator model has no access to the observed training examples, and such a biased generator learning makes the EBM sampling ineffective and renders limited model training. In this paper, the generator model is learned by MLE to match both the empirical data distribution and the EBM distribution. Such training ensures a stronger generator model which will further facilitate a more effective EBM sampling and learning. We present the background of the generator model and its MLE learning algorithm below, which shall serve as the foundation of our proposed method. Generator model. Let z Rd (d < D) be the low-dimensional latent variables. The generator model [14, 10, 21] seeks to explain the observation signal x by a latent vector z and can be specified as pθ(x, z) = p(z)pθ(x|z) (4) where p(z) is a known prior distribution such as unit Gaussian, e.g., p(z) N(0, Id), and pθ(x|z) N(gθ(z), σ2ID) is the generation model that is specified by neural network gθ(.) that maps from latent space to data space. Maximum likelihood estimation. The MLE learning of the generator model computes log-likelihood over the observed examples as Lp(θ) = 1 n Pn i=1 log pθ(x(i)), where pθ(x)(= R z pθ(x, z)dz) is the marginal distribution. If the sample size n is large, it is equivalent to minimizing the KL divergence KL(pd(x) pθ(x)). The gradient of the likelihood Lp(θ) can be obtained via: max θ Lp(θ) = min θ KL(pd(x) pθ(x)), where θLp(θ) = Epd(x)pθ(z|x)[ θ log pθ(x, z)] (5) With such a gradient, the generator model can be learned via stochastic gradient ascent. Sampling from generator posterior. The Eqn.5 requires the sampling from the generator posterior pθ(z|x). One can use MCMC sampling such as Langevin dynamics [28] that iterates zτ+1 = zτ + s zτ log pθ(zτ|x) + where Uτ N(0, Id). Such a Langevin process is an explaining-away inference where the latent factors compete with each other to explain each training example. As s 0, and τ , the distribution of zτ will converge to the posterior pθ(z|x) regardless of the initial distribution of z0 [28]. However, noise-initialized Langevin [14, 31] can be ineffective in traversing the latent space and hard to mix. In this paper, we introduce a complementary model, i.e., inference model, as an informative initializer for effective yet efficient latent space MCMC exploration for better generator and EBM training. 2.3 Inference model The inference model qϕ(z|x) is adopted in VAEs [21, 32] as an amortized sampler to bypass the costly noise-initialized latent space MCMC sampling. In VAEs, qϕ(z|x) is Gaussian parameterized, i.e., N(µϕ(x), Vϕ(x)), where µϕ(x) is the mean d-dimensional mean vector and Vϕ(x) is the ddimensional diagonal covariance matrix. Such a Gaussian parameterized inference model is a tractable approximation to the true generator posterior pθ(z|x), but can be limited to approximate the multimodal posterior. We adopt the same Gaussian parametrization of qϕ(z|x) in this paper, but unlike the VAEs, our inference model serves as an initializer network that jump-starts the latent MCMC sampling from an informative initialization. The marginal distribution obtained after Langevin can be more general and multi-modal than the Gaussian distribution. 3 Methodology To effectively learn the EBM, we propose a joint learning framework that interweaves maximum likelihood learning algorithms for both the EBM and its complementary models. For the MLE learning of the EBM, MCMC sampling can be initialized through the complementary generator model, while for the MLE learning of the generator model, the latent MCMC sampling can be initialized by the complementary inference model. Three models are seamlessly integrated into our joint framework and are learned through dual-MCMC teaching. 3.1 Dual-MCMC Sampling The EBM πα(x), generator pθ(x) and the inference model qϕ(z|x) defined in Sec.2 naturally specify the three densities on joint space (x, z), i.e., Pθ(x, z) = pθ(x|z)p(z), Πα,ϕ(x, z) = πα(x)qϕ(z|x), Qϕ(x, z) = pd(x)qϕ(z|x) The generator density Pθ(x, z) specifies the joint density through ancestral generator sampling from prior latent vectors. Both the joint EBM density Πα,ϕ(x, z) and data density Qϕ(x, z) include inference model qϕ(z|x) to bridge the marginal distribution to joint (x, z) space. However, qϕ(z|x) is modeled and learned from two different perspectives, one on empirical observed data distribution pd(x) for real data inference, and one on EBM density πα(x) for generated sample inference. The joint learning schemes [15, 12, 17, 24] based on these joint distributions can be limited, because 1) the generator samples from Pθ(x, z) is conditionally Gaussian distributed (Sec.2.2) which can be ineffective to capture the high-dimensional multi-modal empirical data distribution, and 2) the inference model qϕ(z|x) on observed training examples is assumed to be conditionally Gaussian distributed (Sec.2.3) that is incapable of explaining-away inference [14]. To address the above limitations, we introduce two joint distributions that incorporate MCMC sampling as revision processes, Pθ,α(x, z) = T x α pθ(x|z)p(z) Qϕ,θ(x, z) = pd(x)T z θ qϕ(z|x) where T z θ ( ) denotes the Markov transition kernel of finite step Langevin dynamics that samples z from pθ(z|x) (see Eqn.6), and T x α ( ) denotes the transition kernel that samples x from πα(x) as shown in Eqn.3. Therefore, T x α pθ(x)(= R z T x α (x )pθ(x , z)dzdx ) indicates the marginal distribution of x obtained by running MCMC transition T x α ( ) that is initialized from pθ(x). Similarly, T z θ qϕ(z|x) represents the marginal distribution of z obtained by running T z θ ( ) that is initialized from qϕ(z|x) given observation x (i.e., T z θ qϕ(z|x) = R z T z θ (z )qϕ(z |x)dz ). The Pθ,α(x, z), as a revised generator density, is more expressive on x-space than Pθ(x, z) as the generated samples from pθ(x) are refined via the EBM-guided MCMC sampling. Qϕ,θ(x, z), as a revised data density, can be more expressive on z-space than Qϕ(x, z), as the latent samples from qϕ(z|x) are revised via the generator-guided explaining-away MCMC inference. These MCMCrevised joint densities will be used for better EBM training, while at the same time, they will guide and teach the generator and inference model to better initialize and facilitate MCMC samplings. We jointly train three models within a probabilistic framework based on KL divergence between joint densities. We present below our learning algorithm in an alternative and iterative manner where the new model parameters are updated based on the current model parameters. We present the learning algorithm in Appendix. 3.2 Learning Energy-based Model Learning the EBM is based on the minimization of KL divergences as min α Dπ(α) = min α KL( Qϕt,θt(x, z) Πα,ϕ(x, z)) KL( Pθt,αt(x, z) Πα,ϕ(x, z)) αDπ(α) = Epd(x)[ αfα(x)] ET x αtpθt(x)[ αfα(x)] (7) where αt, θt, ϕt denote fixed copies of EBM, generator, and inference model at the t-th step in an iterative algorithm. The joint densities Qϕt,θt and Pθt,αt are based on this current iteration. Comparing Eqn.7 to Eqn.2, we compute sampling from EBM through Langevin transition with current pθt(x) as an initializer, i.e., T x αtpθt(x). Such a generator initialized MCMC is more effective and efficient compared to the noise-initialized transition, T x αt(ϵx) where ϵx N(0, ID), that is used in recent literature [29, 5, 6]. MLE perturbation. The above joint space KL divergences are equivalent to the marginal version, min α Dπ(α) = min α KL(pd(x) πα(x)) KL(T x αtpθt(x) πα(x)) (8) Additionally, T x αtpθt(x) παt(x) if s 0 and τ (see Eqn.3), thus Eqn.8 amounts to the approximation of MLE objective function with a KL perturbation term, i.e., KL(pd(x) πα(x)) KL(παt(x) πα(x)) (9) Such surrogate form is more tractable than the MLE objective function, since the log Z(α) term is canceled out. The πα(x) seeks to approach the data distribution pd(x) while escapes from its current version παt(x), thus can be treated as its own critic. The learning of EBM can then be interpreted as a self-adversarial learning [15, 40]. Connection to variational learning. It is also tempting to learn the EBM without MCMC sampling via gradient Epd(x)[ αfα(x)] Epθt(x)[ αfα(x)] (i.e., minα KL(Qϕt(x, z) Πα,ϕ(x, z)) KL(Pθt(x, z) Πα,ϕ(x, z)) ), which underlies the variational joint learning [15, 4, 12, 24]. Compared to Eqn.7, their generator serves as a direct sampler for EBM, while we perform the EBM self-guided MCMC revision for more accurate samples. 3.3 Learning Generator Model via Dual-MCMC Teaching As a complementary model for learning the EBM, the generator model becomes a key ingredient toward success. The generator model is learned through the minimization of KL divergences as min θ Dp(θ) = min θ KL( Qϕt,θt(x, z) Pθ(x, z)) + KL( Pθt,αt(x, z) Pθ(x, z)) θDp(θ) = Epd(x)T z θtqϕt(z|x)[ θ log pθ(x, z)] + ET x αtpθt(x|z)p(z)[ θ log pθ(x, z)] (10) where both the Qϕt,θt and Pθt,αt are based on the current iteration. The revised data density Qϕt,θt teaches the generator to better match with empirical data observations through the first KL term, and the revised generator density Pθt,αt teaches the generator to better match with generated samples through the second KL term. As we describe below, such a joint minimization scheme provides a tractable approximation of the generator learning with marginal distribution, i.e., min θ KL(pd(x) pθ(x)) + KL(παt(x) pθ(x)) where generator model pθ(x) learns to match the pd(x) on empirical data observations and catch up with the current EBM density παt(x) through guidance of its generated samples. MLE perturbation on pd. Our generator model matches empirical data distribution pd(x) through KL( Qϕt,θt(x, z) Pθ(x, z)) and is equivalent to the marginal version that follows, KL( Qϕt,θt(x, z) Pθ(x, z)) = KL(pd(x) pθ(x)) + Epd(x)[KL(T z θtqϕt(z|x) pθ(z|x))] (11) Given s 0 and τ , T z θtqϕt(z|x) pθt(z|x) (see Eqn.6), the first KL term (in Eqn.10) thus approximates the true MLE objective function with additional KL perturbation term, i.e., KL(pd(x) pθ(x)) + Epd(x)[KL(pθt(z|x) pθ(z|x))] (12) Such surrogate form in joint density upper-bounds (i.e., majorizes) the true MLE objective KL(pd(x) pθ(x)) and can be more tractable as it involves the complete-data model with latent vector z has been inferred in the current learning step. Minimizing the surrogate form in the iterative algorithm makes the generator pθ(x) to be closer to the empirical pd(x) due to its majorization property [15]. MLE perturbation on παt. Our generator model is learned to catch up with the EBM model πα(x) through the second term KL( Pθt,αt(x, z) Pθ(x, z)). It is equivalent to the marginal version as KL( Pθt,αt(x, z) Pθ(x, z)) = KL(T x αtpθt(x) pθ(x)) + ET x αtpθt(x)[KL(pθt(z|x) pθ(z|x))] (13) With s 0 and τ , T x αtpθt(x) παt(x) (see Eqn.3), our second KL term approximates (in Eqn.10) the MLE objective on παt for generator, KL(παt(x) pθ(x)) + Eπαt(x)[KL(pθt(z|x) pθ(z|x))] (14) Such surrogate in joint density again upper-bounds (i.e., majorizes) the true MLE objective on generated samples, i.e., KL(παt(x) pθ(x)), and thus the generator pθ(x) updates to be closer to the EBM πα(x) at the current iteration. Connection to variational learning. Without MCMC inference, the generator model can be learned with inference model to match the empirical data distribution, i.e., minθ KL(Qϕt(x, z) Pθ(x, z)), which underlies VAEs [21, 32, 26, 37]. Compared to Eqn.11, VAEs seek to minimize KL(pd(x) pθ(x)) + Epd(x)[KL(qϕt(z|x) pθ(z|x))] where qϕ(z|x) is assumed to be Gaussian distributed which has limited capacity for generator learning. Connection to cooperative learning. Cooperative learning schemes [42, 43] share similar EBM training procedures but can be fundamentally different in generator learning. The generators are learned through minθ KL(Παt,ϕt(x, z) Pθ(x, z)) [43] or minθ KL( Pθt,αt(x, z) Pθ(x, z)) [42], however, generators have no access to the empirical observations which lead to biased and sub-optimal generator models. 3.4 Learning Inference Model via Dual-MCMC Teaching The inference model qϕ(z|x) serves as a key component for generator learning which will in turn facilitate the EBM training. In this paper, the inference model is learned by minimizing the KL divergences Dq(ϕ) as min ϕ Dq(ϕ) = min ϕ KL( Qϕt,θt(x, z) Qϕ(x, z)) + KL( Pθt,αt(x, z) Πα,ϕ(x, z)) ϕDq(ϕ) = Epd(x)T z θtqϕt(z|x)[ ϕ log qϕ(z|x)] + ET x αtpθt(x,z)[ ϕ log qϕ(z|x)] (15) The revised data density Qϕt,θt teaches the inference model on empirical data observations for better real data inference through the first KL term, and the revised generator density Pθt,αt teaches the inference model for better generated sample inference through the second KL term. Real data inference. Optimizing the first term KL( Qϕt,θt(x, z) Qϕ(x, z)) in Eqn.15 is equivalent to minϕ Epd(x)[KL(T z θtqϕt(z|x) qϕ(z|x))]. Given the long-run optimality condition of the MCMC transition T z θtqϕt(z|x) pθt(z|x), our first term KL( Qϕt,θt(x, z) Qϕ(x, z)) tends to learn qϕ(z|x) by minimizing the Epd(x)[KL(pθt(z|x) qϕ(z|x))]. The inference model is learned to match the true generator posterior pθ(z|x) on real observations in the current learning step. Specifically, latent samples are initialized from current qϕt(z|x), and the generator-guided MCMC revision T z θtqϕt(z|x) is then performed to obtain the revised latent samples. The inference model is updated to amortize the MCMC and to absorb such sample revision. The MCMC revision not only drives the evolution of the latent samples, but also drives the evolution of the inference model. Generated sample inference. Optimizing the second term KL( Pθt,αt(x, z) Πα,ϕ(x, z)) in Eqn.15 is equivalent to minϕ ET x αtpθt(x)[KL(pθt(z|x) qϕ(z|x))] which tends to minimizing Eπαt(x)[KL(pθt(z|x) qϕ(z|x))] given long-run optimality condition (i.e., T x αtpθt(x) παt(x)). The inference model is learned to match the true generator posterior pθ(z|x) on generated samples from EBM in the current learning step. Noted that both the generated sample and its latent factor can be readily available where the latent factor z is drawn from prior distribution p(z), which is assumed to be unit Gaussian, and the generated sample is obtained directly from generator. 4 Related Work Energy-based model. The EBM is flexible and theoretically appealing with various approaches to learning, such as the noise-contrastive estimation (NCE) [1, 38] and the diffusion approach [8]. Most existing works learn the EBM via MLE [29, 30, 5, 6, 39], which typically involves MCMC sampling, while some advance [15, 16, 12, 17, 24] propose to amortize MCMC sampling with the generator model and learn EBM in a close-formed, variational learning scheme. Instead, [42, 43] recruit ancestral Langevin dynamics with the generator model being the initializer model. In this paper, we propose a joint framework where the generator model matches both the EBM and empirical data distribution through dual-MCMC teaching to better benefit the EBM sampling and learning. Generator model. In recent years, the success of generator models has given rise to various ways of learning methods. Generative adversarial network (GAN) [10, 19, 27, 20] jointly trains the generator model with a discriminator, while VAE [21, 32, 36, 9] is trained with an inference model (or encoder) approximating the generator posterior. Without the inference model, [14, 31] instead utilize MCMC sampling to sample from the generator posterior. Our work differs from theirs by employing the MCMC inference based on informative initialization from the inference model, and we aim to learn the generator model to facilitate effective learning of the EBM. 5 Experiment In this section, we address the following questions: (1) Can our method learn an EBM with highquality synthesis? (2) Can both the complementary generator and inference model successfully match their MCMC-revised samples? and (3) What is the influence of the inference model and generator model? We refer to implementation details and additional experiments in Appendix. 5.1 Image Modelling We first evaluate the EBM in image data modelling. Both the generator model and EBM are learned to match empirical data distribution, and if the generator model is well-trained, it can serve as an informative initializer model, making the EBM sampling easier. As a result, the EBM should be capable of generating realistic image synthesis. For evaluation, we generate images from the EBM by obtaining x0 from the generator and running Langevin dynamics with x0 being the initial points. Table 1: FID and IS on CIFAR-10 and Celeb A-64. Methods CIFAR-10 Celeb A-64 IS ( ) FID ( ) FID ( ) Ours 8.55 9.26 5.15 Cooperative EBM [42] 6.55 33.61 16.65 Amortized EBM [43] 6.65 - - Divergence Triangle [15] 7.23 30.10 18.21 No MCMC EBM [12] - 27.5 - Short-run EBM [29] 6.21 - 23.02 IGEBM [5] 6.78 38.2 - Improved CD EBM [6] 7.85 25.1 - Diffusion EBM [8] 8.30 9.58 5.98 VAEBM [39] 8.43 12.19 5.31 NCP-VAE[1] - 24.08 5.25 SNGAN [27] 8.22 21.7 6.1 Style GANv2 w/o ADA[20] 8.99 9.9 2.32 NCSN[34] 8.87 25.32 25.30 DDPM[18] 9.46 3.17 3.93 Table 2: FID on Celeb A-HQ-256 and LSUNChurch-64. Methods Celeb A-HQ-256 LSUN-Church-64 Ours 15.89 4.56 Diffusion EBM [8] - 7.02 VAEBM [39] 20.38 13.51 NCP-VAE[1] 27.79 - GLOW[22] 68.93 59.35 PGGAN [19] 21.7 6.1 Figure 1: Image synthesis on CIFAR-10. We benchmark our method on standard datasets such as CIFAR-10 [23] and Celeb A-64 [25], as well as challenging high-resolution Celeb A-HQ-256 [19] and large-scale LSUN-Church-64 [44]. We consider the baseline models, including Divergence Triangle [15], No MCMC EBM [12], Cooperative EBM [42], and Amortized EBM [43], as well as modern advanced generative models, including other EBMs [29, 6, 8, 39, 1], GANs [27, 20] and score-based models [34, 18]. We recruit Fr echet Inception Distance (FID) and Inception Score (IS) metrics to evaluate the quality of image synthesis. Results are reported in Tab.1 and Tab.2 where our EBM shows the capability of generating realistic image synthesis and renders competitive performance even compared to GANs and score-based models. 5.2 MCMC Revision The complementary generator and inference model are learned to match their MCMC-revised samples and thus can serve as informative initializers. We demonstrate that both the generator and inference model can successfully catch up with the MCMC revision. We train our model on Celeb A-64 using Langevin steps kx = 30 for the MCMC revision on x and kz = 10 for the MCMC revision on z. Generator model. If the generator model captures different modes of the EBM, the MCMC revision on x should only need to search around the local mode and correct pixel-level details. To examine the generator model, we visualize the Langevin transition by drawing xi for every three steps from generated samples x0 to MCMC-revised samples xk. As shown in Fig.2, only minor changes can be observed during the transition, suggesting that the generator has matched the EBM-guided MCMC revision. By measuring FID of x0 and xk, it still improves from 5.94 to 5.15, which indicates pixel-level refinements. Inference model. We then show the Langevin transition on z. For visualization, latent codes are mapped to data space via the generation network. We draw zi for each step and show corresponding images in Fig.2, where the inference model also catches up with the generator-guided explaining-away MCMC inference, leading to faithful reconstruction as a result. Figure 2: Left top: MCMC revision on x. The leftmost images are sampled from the generator model, and the rightmost images are at the final step of the EBM-guided MCMC sampling. Left bottom: Energy profile over steps. Right top: MCMC revision on z. The leftmost images are reconstructed by latent codes inferred from the inference model, and the rightmost images are reconstructed by latent codes at the final step of the generator-guided MCMC inference. Right bottom: Mean Squared Error (MSE) over steps. 5.3 Analysis of Inference Model The inference model serves as an initializer model for generator learning which in turn facilitates the EBM sampling and learning. To demonstrate the benefit of the inference model, we adopt noise-initialized Langevin dynamics for generator posterior sampling and compare with the Langevin dynamics initialized by the inference model. Figure 3: MSE( ), SSIM( ) and PSNR ( ). Table 3: Comparison of MSE. Inf+L=10 denotes using Langevin dynamics initialized by inference model for kz = 10 steps. Methods CIFAR-10 Celeb A-64 VAE[21] 0.0341 0.0438 WAE[36] 0.0291 0.0237 RAE[9] 0.0231 0.0246 ABP[14] 0.0183 0.0277 SR-ABP[31] 0.0262 0.0330 Cooperative EBM[42] 0.0271 0.0387 Divergence Triangle[15] 0.0237 0.0281 Ours (Inf) 0.0214 0.0227 Ours (Inf+L=10) 0.0072 0.0164 Specifically, we conduct noise-initialized Langevin dynamics with increasing steps from kz = 10 to kz = 30, and compare with the Langevin dynamics using only kz = 10 steps but is initialized by the inference model. We recruit MSE, Peak Signal-to-Noise Ratio (PSNR), and Structural SIMilarity (SSIM) to measure the inference accuracy of reconstruction and present the results in Fig.3. As the Langevin steps increase, the inference becomes more accurate (lower MSE, higher PSNR, and SSIM), however, it is still less accurate than the proposed method (L=30 vs. Inf+L=10). This result highlights the inference model in our framework. We then compare with other models that also characterize an inferential mechanism, such as VAE [21], Wasserstein auto-encoders (WAE) [36], RAE [9], Alternating Back-propagation (ABP) [14], and Short-run ABP (SR-ABP) [31]. As shown in Tab.3, our model can render superior performance with faithful reconstruction. 5.4 Analysis of Generator Model With the generator model being the initializer for EBM sampling, exploring the energy landscape Figure 4: Linear interpolation on latent space. The top and bottom three rows indicate image generation and reconstruction, respectively. should become easier by first traversing the lowdimensional latent space. We intend to examine if our generator model can deliver smooth interpolation on the latent space, thus making a smooth transition in the data space. We employ linear interpolation among latent space, i.e., z = (1 α) z1 + α z2, and consider two scenarios, such as the image synthesis and image reconstruction. As shown in Fig.4, our generator model is capable of smooth interpolation for both scenarios, which suggests its effectiveness in exploring the energy landscape. 6 Ablation Studies MCMC steps of T z θ . We analyze the impact of the inference accuracy in our framework by increasing the Langevin steps of T z θ qϕ(z|x). With an inference model initializing the MCMC posterior sampling, further increasing the MCMC steps should deliver more accurate inference and thus benefit the generator and EBM for better performance. Thus, we compute the FID, MSE, and wall-clock training time (seconds / per iteration) in Tab.4. It can be seen that increasing MCMC steps from 10 to 30 indeed slightly improves the generation quality and inference accuracy but requires more training time. We thus report the result of Inf+L=10 in Tab.1 and Tab.3. MCMC steps of T x α . Then, we discuss the impact of the Langevin steps of T x α . Increasing the MCMC steps of Tx should explore the energy landscape more effectively and render better performance in the generation. In Tab.5, starting MCMC steps from 10 to 30, our model exhibits largely improved performance in generation quality but only minor improvement even when we use L = 50 steps. Thus, we report L = 30 steps in Tab.1. Table 4: Increasing MCMC steps of T z θ . L=10 L=30 Inf+L=10 Inf+L=30 FID 17.32 14.51 9.26 9.18 MSE 0.0214 0.0164 0.0072 0.0068 Time (s) 1.576 2.034 1.594 2.112 Table 5: Increasing MCMC steps of T x α . L=10 L=20 L=30 L=50 FID 14.78 11.51 9.26 9.07 Time (s) 0.861 1.241 1.594 2.454 7 Conclusion We present a joint learning scheme that can effectively learn the EBM by interweaving the maximum likelihood learning of the EBM, generator, and inference model through dual-MCMC teaching. The generator and inference model are learned to initialize MCMC sampling of EBM and generator posterior, respectively, while these EBM-guided MCMC sampling and generator-guided MCMC inference, in turn, serve as two MCMC revision processes that are capable of teaching the generator and inference model. This work may share the limitation with other MCMC-based methods in terms of the computational cost, but we expect to impact the active research of learning the EBMs. [1] Jyoti Aneja, Alex Schwing, Jan Kautz, and Arash Vahdat. A contrastive learning approach for training variational autoencoder priors. Advances in neural information processing systems, 34:480 493, 2021. 7 [2] Jiali Cui, Ying Nian Wu, and Tian Han. Learning joint latent space ebm prior model for multi-layer generator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3603 3612, June 2023. 1 [3] Jiali Cui, Ying Nian Wu, and Tian Han. Learning hierarchical features with joint latent space energy-based prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2218 2227, October 2023. 1 [4] Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard H. Hovy, and Aaron C. Courville. Calibrating energy-based generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https://openreview.net/forum?id=Syxeqh P9ll. 5 [5] Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. ar Xiv preprint ar Xiv:1903.08689, 2019. 1, 2, 3, 5, 7 [6] Yilun Du, Shuang Li, Joshua Tenenbaum, and Igor Mordatch. Improved contrastive divergence training of energy based models. ar Xiv preprint ar Xiv:2012.01316, 2020. 1, 2, 5, 7 [7] Marylou Gabrié, Grant M Rotskoff, and Eric Vanden-Eijnden. Adaptive monte carlo augmented with normalizing flows. Proceedings of the National Academy of Sciences, 119(10):e2109420119, 2022. 1 [8] R Gao, Y Song, B Poole, YN Wu, and DP Kingma. Learning energy-based models by diffusion recovery likelihood. In International Conference on Learning Representations (ICLR 2021), 2021. 1, 7 [9] Partha Ghosh, Mehdi SM Sajjadi, Antonio Vergari, Michael Black, and Bernhard Schölkopf. From variational to deterministic autoencoders. ar Xiv preprint ar Xiv:1903.12436, 2019. 7, 8, 9 [10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. 3, 7 [11] Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. ar Xiv preprint ar Xiv:1912.03263, 2019. 1 [12] Will Sussman Grathwohl, Jacob Jin Kelly, Milad Hashemi, Mohammad Norouzi, Kevin Swersky, and David Duvenaud. No {mcmc} for me: Amortized sampling for fast and stable training of energy-based models. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=ixp Sx O9flk3. 1, 3, 4, 5, 7 [13] Louis Grenioux, Éric Moulines, and Marylou Gabrié. Balanced training of energy-based models with adaptive flow sampling. ar Xiv preprint ar Xiv:2306.00684, 2023. 1 [14] Tian Han, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. Alternating back-propagation for generator network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. 3, 4, 7, 8, 9 [15] Tian Han, Erik Nijkamp, Xiaolin Fang, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Divergence triangle for joint training of generator model, energy-based model, and inferential model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8670 8679, 2019. 1, 3, 4, 5, 6, 7, 8 [16] Tian Han, Erik Nijkamp, Linqi Zhou, Bo Pang, Song-Chun Zhu, and Ying Nian Wu. Joint training of variational auto-encoder and latent energy-based model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. 1, 7 [17] Mitch Hill, Erik Nijkamp, Jonathan Craig Mitchell, Bo Pang, and Song-Chun Zhu. Learning probabilistic models from generator latent spaces with hat EBM. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https: //openreview.net/forum?id=Alu QNIIb_Zy. 1, 3, 4, 7 [18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. 1, 7 [19] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. ar Xiv preprint ar Xiv:1710.10196, 2017. 1, 7 [20] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33:12104 12114, 2020. 7 [21] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. 3, 4, 6, 7, 8, 9 [22] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018. 1, 7 [23] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 7 [24] Rithesh Kumar, Sherjil Ozair, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. Maximum entropy generators for energy-based models. ar Xiv preprint ar Xiv:1901.08508, 2019. 1, 3, 4, 5, 7 [25] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. Co RR, abs/1411.7766, 2014. URL http://arxiv.org/abs/1411.7766. 7 [26] Lars Maaløe, Marco Fraccaro, Valentin Liévin, and Ole Winther. Biva: A very deep hierarchy of latent variables for generative modeling. Advances in neural information processing systems, 32, 2019. 1, 6 [27] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018. 7 [28] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11): 2, 2011. 2, 3 [29] Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Learning non-convergent non-persistent short-run mcmc toward energy-based model. Advances in Neural Information Processing Systems, 32, 2019. 1, 2, 3, 5, 7 [30] Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, and Ying Nian Wu. On the anatomy of mcmc-based maximum likelihood learning of energy-based models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5272 5280, 2020. 1, 3, 7 [31] Erik Nijkamp, Bo Pang, Tian Han, Linqi Zhou, Song-Chun Zhu, and Ying Nian Wu. Learning multilayer latent variable model via variational optimization of short run mcmc for approximate inference. In European Conference on Computer Vision, pages 361 378. Springer, 2020. 3, 7, 8, 9 [32] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278 1286. PMLR, 2014. 4, 6, 7 [33] Sergey Samsonov, Evgeny Lagutin, Marylou Gabrié, Alain Durmus, Alexey Naumov, and Eric Moulines. Local-global mcmc kernels: the best of both worlds. Advances in Neural Information Processing Systems, 35:5178 5193, 2022. 1 [34] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019. 1, 7 [35] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020. 1 [36] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. ar Xiv preprint ar Xiv:1711.01558, 2017. 7, 8, 9 [37] Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. Advances in Neural Information Processing Systems, 33:19667 19679, 2020. 1, 6 [38] Zhisheng Xiao and Tian Han. Adaptive multi-stage density ratio estimation for learning latent space energy-based model. In Neur IPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/ hash/874a4d89f2d04b4bcf9a2c19545cf040-Abstract-Conference.html. 7 [39] Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. Vaebm: A symbiosis between variational autoencoders and energy-based models. In International Conference on Learning Representations, 2020. 1, 7 [40] Zhisheng Xiao, Qing Yan, and Yali Amit. Ebms trained with maximum likelihood are generator models trained with a self-adverserial loss. ar Xiv preprint ar Xiv:2102.11757, 2021. 5 [41] Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. In International Conference on Machine Learning, pages 2635 2644. PMLR, 2016. 2 [42] Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, and Ying Nian Wu. Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence, 42(1):27 45, 2018. 2, 3, 6, 7, 8 [43] Jianwen Xie, Zilong Zheng, and Ping Li. Learning energy-based model with variational auto-encoder as amortized sampler. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10441 10451, 2021. 1, 2, 3, 6, 7 [44] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. ar Xiv preprint ar Xiv:1506.03365, 2015. 7