# multimeasurement_generative_models__65cc3faa.pdf

Published as a conference paper at ICLR 2022

MULTIMEASUREMENT GENERATIVE MODELS

Saeed Saremi1, 2& Rupesh Kumar Srivastava1

1NNAISENSE Inc. 2Redwood Center, UC Berkeley {saeed,rupesh}@nnaisense.com

We formally map the problem of sampling from an unknown distribution with a density in Rd to the problem of learning and sampling a smoother density in RMd obtained by convolution with a ﬁxed factorial kernel: the new density is referred to as M-density and the kernel as multimeasurement noise model (MNM). The M-density in RMd is smoother than the original density in Rd, easier to learn and sample from, yet for large M the two problems are mathematically equivalent since clean data can be estimated exactly given a multimeasurement noisy observation using the Bayes estimator. To formulate the problem, we derive the Bayes estimator for Poisson and Gaussian MNMs in closed form in terms of the unnormalized M-density. This leads to a simple least-squares objective for learning parametric energy and score functions. We present various parametrization schemes of interest including one in which studying Gaussian M-densities directly leads to multidenoising autoencoders this is the ﬁrst theoretical connection made between denoising autoencoders and empirical Bayes in the literature. Samples in Rd are obtained by walk-jump sampling (Saremi & Hyv arinen, 2019) via underdamped Langevin MCMC (walk) to sample from M-density and the multimeasurement Bayes estimation (jump). We study permutation invariant Gaussian M-densities on MNIST, CIFAR-10, and FFHQ-256 datasets, and demonstrate the effectiveness of this framework for realizing fast-mixing stable Markov chains in high dimensions.

1 INTRODUCTION

Consider a collection of i.i.d. samples {xi}n i=1, assumed to have been drawn from an unknown distribution with density p X in Rd. An important problem in probabilistic modeling is the task of drawing independent samples from p X, which has numerous potential applications. This problem is typically approached in two phases: approximating p X, and drawing samples from the approximated density. In unnormalized models the ﬁrst phase is approached by learning the energy function f X associated with the Gibbs distribution p X exp( f X), and for the second phase one must resort to Markov chain Monte Carlo methods, such as Langevin MCMC, which are typically very slow to mix in high dimensions. MCMC sampling is considered an art and we do not have black box samplers that converge fast and are stable for complex (natural) distributions. The source of the problem is mainly attributed to the fact that the energy functions of interest are typically highly nonconvex.

A broad sketch of our solution to this problem is to model a smoother density in an M-fold expanded space. The new density denoted by p(y), called M-density, is deﬁned in RMd, where the boldfaced y is a shorthand for (y1, . . . , y M). M-density is smoother in the sense that its marginals pm(ym) are obtained by the convolution pm(ym) = R pm(ym|x)p(x)dx with a smoothing kernel pm(ym|x) which for most of the paper we take to be the isotropic Gaussian:

Ym = X + N(0, σ2 m Id).

Although we bypass learning p(x), the new formalism allows for generating samples from p(x) since X can be estimated exactly given Y = y (for large M). To give a physical picture, our approach here is based on taking apart the complex manifold where the random variable X is concentrated in Rd and mapping it to a smoother manifold in RMd where Y = (Y1, . . . , YM) is now concentrated.

Smoothing a density with a kernel is a technique in nonparametric density estimation that goes back to Parzen (1962). In kernel density estimation, the estimator of p(x) is obtained by convolving the

Published as a conference paper at ICLR 2022

empirical measure with a kernel. In that methodology, the kernel bandwidth (σ, for Gaussian kernels) is adjusted to estimate p(x) in Rd given a collection of independent samples {xi}n i=1. This estimator, like most nonparametric estimators, suffers from a severe curse of dimensionality (Wainwright, 2019). But what if the kernel bandwidth is ﬁxed: how much easier is the problem of estimating p(y)? This question is answered in (Goldfeld et al., 2020), where they obtained the rate of convergence e O(d)n 1/2 (measured using various distances) in remarkable contrast to the well-known n 1/d rate for estimating p(x). This nonparametric estimation result is not directly relevant here, but it formalizes the intuition that learning p(y) = R p(y|x)p(x)dx is a lot simpler than learning p(x).

With this motivation, we start with an introduction to the problem of learning unnormalized p(y), based on independent samples from p(x). This problem was formulated by Vincent (2011) using score matching (Hyv arinen, 2005). It was approached recently with the more fundamental methodology of empirical Bayes (Saremi & Hyv arinen, 2019). The idea is to use the Bayes estimator of X given Y = y, the study of which is at the root of the empirical Bayes approach to statistics (Robbins, 1956), in a least-squares objective. This machinery builds on the fact that the estimator bx(y) = E[X|Y = y] can be expressed in closed form in terms of unnormalized p(y) (Sec. 3.1). For Gaussian kernels, the learning objective arrived at in (Saremi & Hyv arinen, 2019) is identical (up to a multiplicative constant) to the denoising score matching formulation (Vincent, 2011), but with new insights rooted in empirical Bayes which is the statistical framework for denoising.

The main problem with the empirical Bayes methodology is that p(x|y) remains unknown and cannot be sampled from. The estimator bx(y) = E[X|Y = y] can be computed, but the concentration of the posterior p(x|y) around the mean is not in our control. Our solution to this problem starts with an observation that is very intuitive from a Bayesian perspective: one can sharpen the posterior by simply taking more independent noisy measurements. This scheme is formalized by replacing p(y|x) with the factorial kernel p(y|x):

m=1 pm(ym|x), where y = (y1, . . . , y M), (1)

which we name multimeasurement noise model (MNM). Now, the object of interest is a different density which we call M-density obtained by convolving p(x) with the factorial kernel:

p(y) = Z p(y|x)p(x)dx. (2)

This formally maps the original problem of drawing samples from p(x) to drawing samples from p(y) for any ﬁxed noise level since the estimator of X given Y = y is asymptotically exact. We quantify this for Gaussian MNMs using the plug-in estimator (the empirical mean of measurements).

Smooth & Symmetric! Consider Gaussian MNMs with equal noise level σ in the regime of large σ, large M such that σ p

d/M is small .1 In that regime, the complex manifold associated with the data distribution is mapped to a very smooth symmetric manifold in a much higher dimensional space. The original manifold can be reconstructed via a single step by computing bx(y). Due to equal noise levels, the manifold associated with M-density is symmetric under the permutation group: p(y1, . . . , y M) = p(yπ(1), . . . , yπ(M)), (3) where π is a permutation of indices (Fig. 1). Although we develop a general methodology for studying M-densities, in the later part of the paper we focus on permutation invariant Gaussian M-densities.

The paper is organized as follows. In Sec. 2, we derive Bayes estimators for Poisson and Gaussian MNMs. In Sec. 3, we present the least-squares objective for learning Gaussian M-densities. We also give a weaker formulation of the learning objective based on score matching. Sec. 4 is devoted to the important topic of parametrization, where we introduce multidenoising autoencoder (MDAE) in which we formally connect M-densities to the DAE literature. DAEs have never been studied for factorial kernels and the emergence of MDAE as a generative model should be of wide interest. In addition, we introduce metaencoder formulated in an unnormalized latent variable model, which is mainly left as a side contribution. In Sec. 5, we present the sampling algorithm used in the paper. In Sec. 6, we present our experiments on MNIST, CIFAR-10, and FFHQ-256 datasets which were focused on permutation invariant M-densities. The experiments are mainly of qualitative nature demonstrating the effectiveness of this method in generating fast mixing and very long Markov chains in high dimensions. Related works are discussed in Sec. 7, and we ﬁnish with concluding remarks.

1The regime σ p

d/M 1 is obtained in our analysis of the highly suboptimal plug-in estimator (Sec. 2.3).

Published as a conference paper at ICLR 2022

-6 -4 -2 0 2 4 6 0.00

(a) p(x) (b) p(y) (c) log p(y) (d) log p(y)

Figure 1: (M-density) (a) A mixture of Gaussian in 1d. (b,c,d) The M-density (M = 2, σ1 = σ2), the corresponding log-density and score function are visualized (based on calculations in Appendix A).

Notation. The subscripts are dropped from densities and energy functions when it is clear from their arguments: p(y) = p Y(y), p(y|x) = p Y|X=x(y), f(y) = f Y(y), etc. Bold fonts are reserved for multimeasurement random variables: Y = (Y1, . . . , YM). The following are shorthand notations: [M] = {1, . . . , M} and m = ym. Throughout, is the gradient with respect to inputs (in RMd), not parameters. The following convention is used regarding parametric functions: fθ( ) = f( ; θ). Different parametrization schemes come with a different set of parameters, the collection of which we denote by θ. For all the datasets used in the paper, X takes values in the hypercube [0, 1]d.

2 FORMALISM: MULTIMEASUREMENT BAYES ESTIMATORS

This work is based on generalizing the empirical Bayes methodology to MNMs. It is well known that the least-squares estimator of X given Y = y (for any noise model) is the Bayes estimator:

R xp(y|x)p(x)dx R p(y|x)p(x)dx . (4)

Next we study this estimator a la Robbins (1956) for Poisson (the Poisson kernel was the ﬁrst example studied in 1956) and Gaussian MNMs. In both cases the estimator bx(y) is derived to be a functional of the joint density p(y). In addition, bx(y) is invariant to scaling p(y) by a constant, therefore one can ignore the partition function in this estimation problem. This is the main appeal of this formalism. Analytical results for Poisson MNMs are included to demonstrate the generality of the new formalism, but we will not pursue it as a generative model in our experiments for technical reasons due to the challenges regarding sampling discrete distributions in high dimensions (see Remark 7).

2.1 POISSON MNM

Let X be a random variable taking values in R+. The Poisson MNM is deﬁned by:

p(y|x) = e Mx M Y

yl! , yl N.

The numerator in r.h.s. of Eq. 4 is computed next. The measurement index m below is an arbitrary index in [M] used for absorbing x such that xp(y|x) has the same functional form as p(y|x): Z xp(y|x)p(x)dx = Z e Mx(ym + 1) x(ym+1)

yl! p(x)dx = (ym + 1)p(y + 1m),

where 1m is deﬁned as a vector whose component l is δml. Using Eq. 4, it immediately follows

bx(y) = (ym + 1)p(y + 1m)

p(y) , m [M]. (5)

We emphasize that the dependency of bx(y) on the noise channel (measurement index) m that appears on the right hand side of the expression above is only an artifact of the calculation (we observe this again for Gaussian MNMs). The result above holds for any measurement index m [M], therefore (ym + 1)p(y + 1m) = (ym + 1)p(y + 1m ) for all m, m [M].

Published as a conference paper at ICLR 2022

Example. We can derive the estimator bx(y) analytically for p(x) = e x. We ﬁrst derive p(y):2

l yl! (M + 1) 1 P

where the sums/products are over the measurement indices l [M]. Using Eq. 5, it follows

bx(y) = (ym + 1)p(y + 1m)

p(y) = (ym + 1) P l yl + 1 ym + 1 1 M + 1 = P l yl + 1 M + 1 As expected one arrives at the same result by computing Eq. 5 for any measurement index m.

2.2 GAUSSIAN MNM

Let X be a random variable in Rd. The Gaussian MNM is deﬁned by:

m=1 pm(ym|x), where pm(ym|x) = N(ym; x, σ2 m Id). (6)

It follows (as in the Poisson MNM, m in the equation below is an arbitrary index in [M]): σ2 m mp(y|x) = (x ym)p(y|x). We multiply both sides of the above expression by p(x) and integrate over x. The derivative m (with respect to ym) and the integration over x commute, and using Eq. 4 it follows σ2 m mp(y) = bx(y)p(y) ymp(y), which we simplify by dividing both sides by p(y): bx(y) = ym + σ2 m m log p(y), m [M]. (7) This expression is the generalization of the known result due to Miyasawa (1961). As in the Poisson MNM, the result above holds for any m [M], therefore: ym + σ2 m m log p(y) = ym + σ2 m m log p(y) for all m, m [M]. (8)

Example. We also studied the M-density for Gaussian MNMs analytically. The calculations are insightful and give more intuitions on the new formalism. We refer to Appendix A for the results.

Remark 1 (bx(m) θ (y) notation). For parametric M-densities, Eq. 8 is in general an approximation. We use the superscript m to emphasize that there are M ways to compute the parametric estimator:

bx(m) θ (y) = ym + σ2 m m log pθ(y) Remark 2 (σ M notation). Gaussian MNMs with the constant noise level σ are denoted by σ M. For σ M models, the M-density p(y) (resp. the score function log p(y)) is invariant (resp. equivariant) with respect to the permutation group π : [M] [M]. See Fig. 1 for an illustration.

2.3 CONCENTRATION OF THE PLUG-IN ESTIMATOR

The plug-in estimator of X is the empirical mean of the multiple noisy measurements we denote by bxmean(y) = M 1 P

m ym. This estimator is highly suboptimal but its analysis in high dimensions for Gaussian MNMs is useful. Due to the concentration of measure phenomenon (Tao, 2012) we have x bxmean(y) 2 σeff

d, (9) where σeff ( eff is for effective) is given by

The calculation is straightforward since ym = x + εm, where εm N(0, σ2 m Id). It follows: x bxmean(y) = M 1 P

m εm which has the same law as N(0, σ2 eff Id). This calculation shows that the estimator of X in Rd concentrates at the true value at a worst-case rate O( p

d/M) (consider replacing the sum in Eq. 10 by Mσ2 max). This analysis is very conservative, as it ignores the correlations between the components of y in RMd (within and across noise channels), and one expects a (much) tighter concentration for the optimal estimator. In the next section, we present an algorithm to learn the multimeasurement Bayes estimator based on independent samples from p(x).

2The derivation is straightforward using R 0 e αxxβdx = α 1 ββ! for α > 0, β N.

Published as a conference paper at ICLR 2022

3 LEARNING GAUSSIAN M-DENSITIES

3.1 NEURAL EMPIRICAL BAYES

In this section, we focus on learning Gaussian M-densities using the empirical Bayes formalism. We closely follow the approach taken by Saremi & Hyv arinen (2019) and extend it to M-densities. The power of empirical Bayes lies in the fact that it is formulated in the absence of any clean data. This is reﬂected by the fact that bx(y) is expressed in terms of p(y) which can in principle be estimated without observing samples from p(x) (Robbins, 1956); we generalized that to factorial kernels in Sec. 2. What if we start with independent samples {xi}n i=1 from p(x) and our goal is to learn p(y)?

A key insight in neural empirical Bayes was that the empirical Bayes machinery can be turned on its head in the form of a Gedankenexperiment (Saremi & Hyv arinen, 2019): we can draw samples (indexed by j) from the factorial kernel yij p(y|xi) and feed the noisy data to the empirical Bayes experimenter (the word used in 1956). The experimenter s task (our task!) is to estimate X, but since we observe X = xi, the squared ℓ2 norm xi bxθ(yij) 2 2 serves as a signal to learn p(y), and also bx(y) for unseen noisy data (as a reminder the Bayes estimator is the least-squares estimator).

Next, we present the least-squares objective to learn p(y) e f(y) for Gaussian M-densities. It is important to note that bx(y) is expressed in terms of unnormalized p(y), without which we must estimate the partition function (or its gradient) during learning. Here, we can choose any expressive family of functions to parametrize bx(y) which is key to the success of this framework. Using our formalism (Sec. 2.2), the Bayes estimator takes the following parametric form (see Remark 1):

bx(m) θ (y) = ym σ2 m mfθ(y), m [M]. (11) There are therefore M least-squares learning objectives

L(m)(θ) = E(x,y)L(m)(x, y; θ), where L(m)(x, y; θ) = x bx(m) θ (y) 2 2 (12) that in principle need to be minimized simultaneously, since as a corollary of Eq. 8 we have:

L(m)(x, y; θ ) L(m )(x, y; θ ) for all m, m [M], x Rd, y RMd. (13) The balance between the M losses can be enforced during learning by using a softmax-weighted sum of them in the learning objective, effectively weighing the higher losses more in each update. However, in our parametrization schemes, coming up, simply taking the mean of the M losses as the learning objective proved to be sufﬁcient for a balanced learning across all the noise channels:

m=1 L(m)(θ). (14)

The above learning objective is the one we use in the remainder of the paper.

3.2 DENOISING SCORE MATCHING

One can also study M-densities using score matching with the following objective (Hyv arinen, 2005):

J(θ) = Ey yfθ(y) y log p(y) 2 2. (15) In Appendix B, we show that the score matching learning objective is equal (up to an additive constant independent of θ) to the following multimeasurement denoising score matching (MDSM) objective:

m=1 Jm(θ), where Jm(θ) = E(x,y) mfθ(y) + ym x

This is a simple extension of the result by Vincent (2011) to Gaussian MNMs. The MDSM objective and the empirical Bayes (Eq. 14) are identical (up to a multiplicative constant) for σ M models. Remark 3. Compared to neural empirical Bayes (NEB), denoising score matching (DSM) takes a very different approach regarding learning M-densities. DSM starts with score matching (Eq. 15). NEB starts with deriving the Bayes estimator of X given Y = y for a known kernel p(y|x) (Eq. 7). NEB is a stronger formulation in the sense that two goals are achieved at once: learning M-density and learning bx(y). What remains unknown in DSM is the latter, and knowing the estimator is key here. Without it, we cannot draw a formal equivalence between p X and p Y (see Sec. 1). We return to this discussion at the end of the paper from a different angle with regards to denoising autoencoders.

Published as a conference paper at ICLR 2022

4 PARAMETRIZATION SCHEMES

We present three parametrization schemes for modeling Gaussian M-densities. Due to our interest in σ M models we switch to a lighter notation. Here, the learning objective (Eq. 14) takes the form

M E(x,y) p(y|x)p(x) x M y + σ2 fθ(y) 2 2, (17)

where x M denotes (x, . . . , x) repeated M times. Parametrization schemes fall under two general groups: multimeasurement energy model (MEM) and multimeasurement score model (MSM). In what follows, MDAE is an instance of MSM, MEM2 & MUVB are (closely related) instances of MEM.

4.1 MDAE (MULTIDENOISING AUTOENCODER)

The rigorous approach to learning Gaussian M-densities is to parametrize the energy function f. In that parametrization, fθ is computed with automatic differentiation and used in the objective (Eq. 17). That is a computational burden, but it comes with a major advantage as the learned score function is guaranteed to be a gradient ﬁeld. The direct parametrization of f is problematic due to this requirement analyzed by (Saremi, 2019); see also (Salimans & Ho, 2021) for a recent discussion. Putting that debate aside, in MDAE we parametrize the score function explicitly gθ : RMd RMd. Then, gθ replaces fθ in Eq. 17. In particular, we consider the following reparametrization: gθ(y) := (νθ(y) y)/σ2, (18) simply motivated by the fact we would like to cancel y in Eq. 17, otherwise the loss starts at very high values at the initialization. This is especially so in the regime of large σ, M, and d. It follows:

L(θ) = M 1 E(x,y) p(y|x)p(x) x M νθ(y) 2 2. (19) This is very intriguing and it is worth taking a moment to examine the result: modeling M-densities is now formally mapped to denoising multimeasurement noisy data explicitly. It is a DAE loss with a multimeasurement twist. Crucially, due to our empirical Bayes formulation, the MDAE output νθ(y) becomes a parametrization of the Bayes estimator(s) (combine Eq. 18 and Eq. 11): bx(m)(y; θ) = νm(y; θ). (20) This makes a strong theoretical connection between our generalization of empirical Bayes to factorial kernels and a new learning paradigm under multidenoising autoencoders, valid for any noise level σ.

4.2 MEM2 (MEM with a QUADRATIC form with an optional METAENCODER)

Can we write down an energy function associated with the score function in Eq. 18? The answer is no, since gθ is not a gradient ﬁeld in general, but we can try the following (θ = (η, ζ)):

fθ(y) := 1 2σ2 y νη(y) 2 2 + hζ(y, νη(y)). (21)

Ignoring hζ for now, by calculating fθ we do get both terms in Eq. 18, plus other terms. The function hζ which we call metaencoder is optional here. Intuitively, it captures the higher order interactions between y and ν, beyond the quadratic term (see below for another motivation). The metaencoder is implicitly parametrized by η (via νη), while having its own set of parameters ζ.

4.3 MUVB (MULTIMEASUREMENT UNNORMALIZED VARIATIONAL BAYES)

The expression above (Eq. 21) is a simpliﬁcation of an unnormalized latent variable model that one can set up, where we take the variational free energy to be the energy function. This builds on recent studies towards bringing together empirical Bayes and variational Bayes (Saremi, 2020a;b). We outline the big picture here and refer to Appendix C for details. Latent variable models operate on the principle of maximum likelihood, where one is obliged to have normalized models. Since our model is unnormalized we consider setting up a latent variable model with unnormalized conditional density. Essentially both terms in Eq. 21 arise by considering (z is the vector of latent variables)

p(η,ζ)(y|z) exp 1

2σ2 y νη(z) 2 2 hζ(y, νη(z)) , (22)

named metalikelihood which further underscores the fact that it is unnormalized. The full expression for the energy function also involves the posterior qφ(z|y). As a remark, note that what is shared between all three parametrization schemes is ν, although vastly different in how bx(y) is computed.

Published as a conference paper at ICLR 2022

5 SAMPLING ALGORITHM

Our sampling algorithm is an adaptation of walk-jump sampling (WJS) (Saremi & Hyv arinen, 2019). We run an MCMC algorithm to sample M-density by generating a Markov chain of multimeasurement noisy samples. This is the walk phase of WJS schematized by the dashed arrows in Fig. 2. At arbitrary discrete time k, clean samples are generated by simply computing bx(yk) (θ is dropped for a clean notation). This is the jump phase of WJS schematized by the solid arrow in Fig. 2. What is appealing about multimeasurement generative models is the fact that for large M, p(x|yk) is highly concentrated around its mean bx(yk), therefore this scheme is an exact sampling scheme this was in fact our original motivation to study M-densities. For sampling the M-density (the walk phase), we consider Langevin MCMC algorithms that are based on discretizing the underdamped Langevin diffusion:

dvt = γvtdt u f(yt)dt + ( p

2γu)d Bt, dyt = vtdt. (23)

Here γ is the friction, u the inverse mass, and Bt the standard Brownian motion in RMd. Discretizing the Langevin diffusion and their analysis are challenging problems due to the non-smooth nature of the Brownian motion (M orters & Peres, 2010). There has been a signiﬁcant progress being made however in devising and analyzing Langevin MCMC algorithms, e.g. Cheng et al. (2018) introduced an algorithm with a mixing time that scales as O(

d) in a notable contrast to the best known O(d) for MCMC algorithms based on overdamped Langevin diffusion. The dimension dependence of the mixing time is of great interest here since we expand the dimension M-fold. To give an idea regarding the dimension, for the 4 8 model on FFHQ-256, Md 106. We implemented (Cheng et al., 2018, Algorithm 1) in this paper which to our knowledge is its ﬁrst use for generative modeling. In addition, we used a Langevin MCMC algorithm due to Sachs et al. (2017), also used by Arbel et al. (2020). Note, in addition to γ and u, Langevin MCMC requires δ, the step size used for discretizing Eq. 23.

yk yk 1 yk+1

Figure 2: (WJS schematic) In this schematic, the Langevin walk is denoted by the dashed arrow. The jump is denoted by the solid arrow which is deterministic. The jumps in WJS are asynchronous (Remark 5). In presenting long chains in the paper we show jumps at various frequencies denoted by k (Remark 6). We use the same MCMC parameters for all noise channels due to permutation symmetry in σ M models. See Appendix D for more details.

(a) 4 4, MDAE

(b) 4 8, MDAE

Figure 3: (WJS chains on FFHQ-256) The chains are shown skipping k = 10 steps (no warmup). We used Algorithm 1 with δ = 2, γ = 1/2, u = 1. Transitions are best viewed electronically.

Published as a conference paper at ICLR 2022

Figure 4: (4 8 gallery) Samples from our FFHQ-256, MDAE, 4 8 model.

6 EXPERIMENTAL RESULTS

For all models with the MDAE and MEM2 parametrizations, we used U2-Net (Qin et al., 2020) network architecture, a recent variant of UNet (Ronneberger et al., 2015), with a few simple changes (see Appendix E). For the MUVB parametrization for MNIST (evaluated in Sec. G.3), we used Residual networks for the encoder, decoder and metaencoder. Despite the inherent complexity of the task of learning high-dimensional energy models, it is notable that our framework results in a single least-squares objective function, which we optimized using Adam (Kingma & Ba, 2014) without additional learning rate schedules. Additional details of the experimental setup are in Appendix E.

Arguably, the ultimate test for an energy model is whether one can generate realistic samples from it with a single fast mixing MCMC chain that explores all modes of the distribution indeﬁnitely, starting from random noise. We informally refer to them as lifelong Markov chains. To give a physical picture, a gas of molecules that has come in contact with a thermal reservoir does not stop mid air after thermalizing arriving at its ﬁrst sample it continues generating samples from the Boltzmann distribution as long as the physical system exists. To meet this challenge, the energy landscape must not have any pathologies that cause the sampling chain to break or get stuck in certain modes. In addition, we need fast mixing (Langevin) MCMC algorithms, a very active area of research by itself.

To put this goal in context, in recent energy models for high-dimensional data (Xie et al., 2016; 2018; Nijkamp et al., 2019; Du & Mordatch, 2019; Zhao et al., 2020; Du et al., 2020; Xie et al., 2021), sampling using MCMC quickly breaks or collapses to a mode and chains longer than a few hundred steps were not reported. Thus, evaluation in prior work relies on samples from independent MCMC chains, in addition by using heuristics like replay buffer (Du & Mordatch, 2019). In this work, we report FID scores obtained by single MCMC chains, the ﬁrst result of its kind, which we consider as a benchmark for future works on long run MCMC chains (see Table 4 for numerical comparisons).

For MNIST, we obtain fast-mixing chains for over 1 M steps using MDAE and MUVB. On CIFAR10 and FFHQ-256, we obtain stable chains up to 1 M and 10 K steps, respectively, using MDAE. The results are remarkable since energy models that learn a single energy/score function have never been scaled to 256 256 resolution images. The closest results are by Nijkamp et al. (2019) and Du et al. (2020) on Celeb A (128 128) in particular, our results in Figs. 3, 4 can be compared to (Nijkamp et al., 2019, Figs. 1, 2). For CIFAR-10, we report the FID score of 43.95 for 1 8 model, which we note is achieved by a single MCMC chain (Fig. 13); the closest result on FID score obtained for long run MCMC is 78.12 by Nijkamp et al. (2022) which is not on single chains, but on several parallel chains (in that paper the long run chains are 2 K steps vs. 1 M steps in this work).

Our detailed experimental results are organized in appendices as follows. Appendix F is devoted to an introduction to Langevin MCMC with demonstrations using our FFHQ-256 4 8 model. Appendix G is devoted to MNIST experiments. At ﬁrst we compare 1/4 1 and 1 16 (Sec. G.1). These models are statistically identical regarding the plug-in estimators (Eq. 10) and they arrive at similar training losses, but very different as generative models. This sheds light on the question why we considered such high noise levels in designing experiments (Fig. 5). We then present the effect of increasing M for a ﬁxed σ, the issue of time complexity, and mixing time vs. image quality trade-off (Sec. G.2). The discussions in Sec. G.1 and Sec. G.2 are closely related. We then compare MDAE and MUVB for 1 4 with the message that MDAE generates sharper samples while MUVB has better mixing properties (Sec. G.3). We then present one example of a lifelong Markov chain (Sec. G.4). Our MDAE model struggles on the CIFAR-10 challenge due to optimization problems. The results and further discussion are presented in Appendix H. Finally, in Appendix I we show chains obtained for FFHQ-256, including the ones that some of the images in Fig. 4 were taken from.

Published as a conference paper at ICLR 2022

Figure 5: (single-step multidenoising) A clean data (x), a sample from MNM (y) for the 4 4 model (Fig. 3a), the outputs of MDAE (ν), and bxmean(y) are visualized from left to right. The outputs νθ(y) correspond to the Bayes estimator(s) which as expected from theory (Eq. 8) are indistinguishable.

7 RELATED WORK

Despite being a fundamental framework for denoising, empirical Bayes has been surprisingly absent in the DAE/DSM literature. Raphan & Simoncelli (2011) ﬁrst mentioned the connection between empirical Bayes and score matching, but this work is practically unknown in the literature; for much of its development, DSM was all tied to DAEs (Vincent, 2011; Alain & Bengio, 2014); see (Bengio et al., 2013) for an extensive survey. Abstracting DSM away from DAE is due to Saremi et al. (2018) where they directly parametrized the energy function. In this work, we closed the circle, from empirical Bayes to DAEs. One can indeed use the MDAE objective and learn a generative model: a highly generalized DAE naturally arise from empirical Bayes. The important subtlety here is, as we remarked in Sec. 3.2, we can only arrive at this connection starting from our generalization of empirical Bayes, not the other way around. Of course, this core multimeasurement aspect of our approach, without which we do not have a generative model, does not exist in the DAE literature.3

To address the problem of choosing a noise level in DSM (Saremi et al., 2018), Song & Ermon (2019) studied it with multiple noise levels by summing up the losses using a weighing scheme. See (Li et al., 2019; Chen et al., 2020; Song & Ermon, 2020; Kadkhodaie & Simoncelli, 2020; Jolicoeur-Martineau et al., 2020) in that direction. The learning objectives in these models are based on heuristics in how different noise levels are weighted. Denoising diffusion models (Ho et al., 2020; Song et al., 2020; Gao et al., 2020) follow the same philosophy while being theoretically sound. Sampling in these models are based on annealing or reversing a diffusion process. Our philosophy here is fundamentally different. Even noise levels have very different meaning here, associated with the M noise channels of the factorial kernel. All noise levels (which we took to be equal in later parts, a meaningless choice in other methods) are encapsulated in a single energy/score function. Using Langevin MCMC we only sample highly noisy data y. Clean data in Rd is generated via a single step by computing bx(y).

8 CONCLUSION

We approached the problem of generative modeling by mapping a complex density in Rd to a smoother dual density, named M-density, in RMd. Permutation invariance is a design choice which we found particularly appealing. Using factorial kernels for density estimation and generative modeling have never been explored before and this work should open up many new avenues. We believe the topic of parametrization will play an especially important role in future developments. There is a unity in the parametrization schemes we proposed and studied in the paper, but much remains to be explored in understanding and extending the relationship between MEMs and MSMs.

Our MDAE parametrization scheme (an instance of MSM) is especially appealing for its simplicity and connections to the past literature. DAEs have a very rich history in the ﬁeld of representation learning. But research on them as generative models stopped around 2015. Our hypothesis is that one must use very large noise to make them work as generative models for complex datasets in high dimensions. Our permutation invariant multimeasurement approach is an elegant solution since it allows for that choice at the cost of computation (large M), with only one parameter left to tune: σ! Crucially, the probabilistic interpretation is ingrained here, given by the algebraic relations between the MDAE output ν(y), the Bayes estimator bx(y), and the score function log p(y).

3There are similarities between Eq. 5 in (Alain & Bengio, 2014) and Eq. 18 here. However, one is on barely noisy data (σ 0) in Rd, the other on multimeasurements in RMd for any (high) noise level σ. There, they arrived at Eq. 5 (with some effort) starting from a DAE objective. Here, we start with Eq. 18 (it deﬁnes the score function) and arrive at MDAE objective (Eq. 19), in one line of algebra.

Published as a conference paper at ICLR 2022

ACKNOWLEDGEMENT

We would like to thank Aapo Hyv arinen, Francis Bach, and Faustino Gomez for their valuable comments on the manuscript, and Vojtech Micka for help in running experiments.

ETHICAL CONSIDERATIONS

By their very nature, generative models assign a high (implicit or explicit) likelihood to points in their training set. When samples are generated from a trained generative model, it is quite possible that some samples are very similar or identical to certain data points in the training set. This is important to keep in mind when the data used to train a generative model is conﬁdential or contains personally identiﬁable information such as pictures of faces in the FFHQ-256 dataset (Karras et al., 2019). Care should be taken to abide by license terms and ethical principles when using samples from generative models. For FFHQ-256 in particular, please see terms of use at https: //github.com/NVlabs/ffhq-dataset/blob/master/README.md. Additionally, we recommend considering the in-depth discussion on this subject by Prabhu & Birhane (2020) before deploying applications based on generative models.

REPRODUCIBILITY STATEMENT

All important details of the experimental setup are provided in Appendix E. Our code is publicly available at https://github.com/nnaisense/mems.

Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data-generating distribution. The Journal of Machine Learning Research, 15(1):3563 3593, 2014.

Philip W Anderson. More is different. Science, 177(4047):393 396, 1972.

Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction to MCMC for machine learning. Machine learning, 50(1-2):5 43, 2003.

Michael Arbel, Liang Zhou, and Arthur Gretton. Generalized energy based models. ar Xiv preprint ar Xiv:2003.05033, 2020.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798 1828, 2013.

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. ar Xiv preprint ar Xiv:2009.00713, 2020.

Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. In Conference on Learning Theory, pp. 300 323. PMLR, 2018.

Rewon Child. Very deep VAEs generalize autoregressive models and can outperform them on images. ar Xiv preprint ar Xiv:2011.10650, 2020.

Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. ar Xiv preprint ar Xiv:1903.08689, 2019.

Yilun Du, Shuang Li, Joshua Tenenbaum, and Igor Mordatch. Improved contrastive divergence training of energy based models. ar Xiv preprint ar Xiv:2012.01316, 2020.

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. ar Xiv preprint ar Xiv:1702.03118, 2017.

William Falcon et al. Pytorch Lightning. URL https://github.com/Py Torch Lightning.

Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, and Diederik P Kingma. Learning energy-based models by diffusion recovery likelihood. ar Xiv preprint ar Xiv:2012.08125, 2020.

Published as a conference paper at ICLR 2022

Ziv Goldfeld, Kristjan Greenewald, Jonathan Niles-Weed, and Yury Polyanskiy. Convergence of smoothed empirical measures with applications to entropy estimation. IEEE Transactions on Information Theory, 66(7):4368 4391, 2020.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. ar Xiv preprint ar Xiv:2006.11239, 2020.

Aapo Hyv arinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695 709, 2005.

Alexia Jolicoeur-Martineau, R emi Pich e-Taillefer, R emi Tachet des Combes, and Ioannis Mitliagkas. Adversarial score matching and improved sampling for image generation. ar Xiv preprint ar Xiv:2009.05475, 2020.

Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183 233, 1999.

Zahra Kadkhodaie and Eero P Simoncelli. Solving linear inverse problems using the prior implicit in a denoiser. ar Xiv preprint ar Xiv:2007.13640, 2020.

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401 4410, 2019.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Alex Krizhevsky. Learning multiple layers of features from tiny images. Master s thesis, University of Toronto, 2009.

Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Zengyi Li, Yubei Chen, and Friedrich T Sommer. Learning energy-based models in high-dimensional spaces with multi-scale denoising score matching. ar Xiv preprint ar Xiv:1910.07762, 2019.

Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. ar Xiv preprint ar Xiv:1312.4400, 2013.

Koichi Miyasawa. An empirical Bayes estimator of the mean of a normal population. Bulletin of the International Statistical Institute, 38(4):181 188, 1961.

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018.

Peter M orters and Yuval Peres. Brownian motion, volume 30. Cambridge University Press, 2010.

Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Learning non-convergent nonpersistent short-run MCMC toward energy-based model. ar Xiv preprint ar Xiv:1904.09770, 2019.

Erik Nijkamp, Ruiqi Gao, Pavel Sountsov, Srinivas Vasudevan, Bo Pang, Song-Chun Zhu, and Ying Nian Wu. MCMC should mix: Learning energy-based model with ﬂow-based backbone. In International Conference on Learning Representations, 2022.

Published as a conference paper at ICLR 2022

Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065 1076, 1962.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in Py Torch. 2017.

Vinay Uday Prabhu and Abeba Birhane. Large image datasets: A pyrrhic win for computer vision? ar Xiv preprint ar Xiv:2006.16923, 2020.

Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar Zaiane, and Martin Jagersand. U2-Net: Going deeper with nested U-structure for salient object detection. volume 106, pp. 107404, 2020.

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Swish: a self-gated activation function. ar Xiv preprint ar Xiv:1710.05941, 7, 2017.

Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural Computation, 23(2):374 420, 2011.

Herbert Robbins. An empirical Bayes approach to statistics. In Proc. Third Berkeley Symp., volume 1, pp. 157 163, 1956.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234 241. Springer, 2015.

Matthias Sachs, Benedict Leimkuhler, and Vincent Danos. Langevin dynamics with variable coefﬁcients and nonconservative forces: from stationary states to numerical methods. Entropy, 19(12): 647, 2017.

Tim Salimans and Jonathan Ho. Should EBMs model the energy or the score? In Energy Based Models Workshop-ICLR 2021, 2021.

Saeed Saremi. On approximating f with neural networks. ar Xiv preprint ar Xiv:1910.12744, 2019.

Saeed Saremi. Learning and inference in imaginary noise models. ar Xiv preprint ar Xiv:2005.09047, 2020a.

Saeed Saremi. Unnormalized variational Bayes. ar Xiv preprint ar Xiv:2007.15130, 2020b.

Saeed Saremi and Aapo Hyv arinen. Neural empirical Bayes. Journal of Machine Learning Research, 20(181):1 23, 2019.

Saeed Saremi, Arash Mehrjou, Bernhard Sch olkopf, and Aapo Hyv arinen. Deep energy estimator networks. ar Xiv preprint ar Xiv:1805.08306, 2018.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. ar Xiv preprint ar Xiv:1907.05600, 2019.

Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. ar Xiv preprint ar Xiv:2006.09011, 2020.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020.

Rupesh Kumar Srivastava, Klaus Greff, and J urgen Schmidhuber. Training very deep networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems Volume 2, pp. 2377 2385, 2015.

Terence Tao. Topics in random matrix theory. American Mathematical Society, 2012.

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661 1674, 2011.

Published as a conference paper at ICLR 2022

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Norbert Wiener. Extrapolation, interpolation, and smoothing of stationary time series: with engineering applications, volume 8. MIT press Cambridge, MA, 1964.

Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. In International Conference on Machine Learning, pp. 2635 2644. PMLR, 2016.

Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, and Ying Nian Wu. Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence, 42(1):27 45, 2018.

Jianwen Xie, Zilong Zheng, and Ping Li. Learning energy-based model with variational auto-encoder as amortized sampler. In The Thirty-Fifth AAAI Conference on Artiﬁcial Intelligence (AAAI), volume 2, 2021.

Yang Zhao, Jianwen Xie, and Ping Li. Learning energy-based generative models via coarse-to-ﬁne expanding and sampling. In International Conference on Learning Representations, 2020.

Published as a conference paper at ICLR 2022

A GAUSSIAN M-DENSITIES

We study the Bayes estimator for Gaussian MNMs analytically for X = N(µ, σ2 0Id), where µ Rd. As in the example for Poisson MNM in the paper, the ﬁrst step is to derive an expression for the joint density

m pm(ym|x)p(x)dx. (24)

We start with M = 2. Next we perform the integral above:

log p(y1, y2) =

y1 2 2(σ2 0 + σ2 2) + y2 2 2(σ2 0 + σ2 1) + µ 2 2(σ2 1 + σ2 2) 2(σ2 1σ2 2 + σ2 2σ2 0 + σ2 0σ2 1)

+ 2 y1, y2 σ2 0 + 2 µ, y2 σ2 1 + 2 µ, y1 σ2 2 2(σ2 1σ2 2 + σ2 2σ2 0 + σ2 0σ2 1) log Z(σ0, σ1, σ2),

where Z is the partition function:

Z(σ0, σ1, σ2) = (2π)2(σ0σ1σ2)2(σ 2 0 + σ 2 1 + σ 2 2 ) d/2 .

The multimeasurement Bayes estimator of X is computed next via Eq. 7:

bx(y1, y2) = µσ2 1σ2 2 + y1σ2 2σ2 0 + y2σ2 0σ2 1 σ2 0σ2 1 + σ2 0σ2 2 + σ2 1σ2 2 . (26)

As a sanity check, bx(y) is the same whether Eq. 7 is computed using m = 1 or m = 2. Another sanity check is the limit σ1 0, which we recover bx(y1, y2) = y1 = x. The linear relation (linear in y for µ = 0) observed in Eq. 26 is a generalization of the well known results for M = 1, known as Wiener ﬁlter (Wiener, 1964).

A visualization of this result for a mixture of Gaussian is given in Fig. 1. Next we state the results for general M, and after that we derive the expression for the Bayes estimator bx(y).

It is convenient to deﬁne the following notations:

m=0 σ2 m ZM, where ZM :=

m=0 σ 2 m ,

m=0 y M\m, where y M\m := ym Y

l [M]\m σ2 l , and y0 := µ.

With these notations, the expression for the M-density is given by:

ym 2 2 2σ2m +

y 2 2 2ΣM d log (2π)Mp

We can now derive bx(y) using Eq. 7. Given the expression above, the algebra is straightforward:

bx(y) = ym + σ2 m m log p(y)

= ym ym + σ2 m(2ΣM) 1 m y 2 2

l=0 σ2 l ZM) 12σ2 m

l [M]\m σ2 l

Note that the ﬁnal expression is the same by carrying the algebra for any measurement index m.

Published as a conference paper at ICLR 2022

B M-DENSITY SCORE MATCHING MULTIMEASUREMENT DSM

Here we provide the score matching formalism (Hyv arinen, 2005) on M-densities and we arrive at a multimeasurement denoising score matching (MDSM) learning objective. The derivation closely follows (Vincent, 2011) using our notation. In what follows equalities are modulo additive constants (not functions of parameters θ) which we denote by the color gray:

J(θ) = Ey yfθ(y) y log p(y) 2 2 = Ey yfθ(y) 2 2 + 2 Ey yfθ(y), y log p(y) +Ey y log p(y) 2 2

= Ey yfθ(y) 2 2 + 2 Z p(y) yfθ(y), y log p(y) dy

= Ey yfθ(y) 2 2 + 2 Z yfθ(y), yp(y) dy

= Ey yfθ(y) 2 2 + 2 Z yfθ(y), yp(y|x) p(x)dxdy

= Ey yfθ(y) 2 2 + 2 Z yfθ(y), p(y|x) 1 yp(y|x) p(y|x)p(x)dxdy

= Ey yfθ(y) 2 2 + 2 E(x,y) yfθ(y), y log p(y|x)

= E(x,y) yfθ(y) 2 2 + 2 E(x,y) yfθ(y), y log p(y|x) +E(x,y) y log p(y|x) 2 2 = E(x,y) yfθ(y) y log p(y|x) 2 2 (29)

The M-density score matching learning objective is given by the ﬁrst identity above and the last equality is the MDSM objective:

J(θ) = E(x,y) yfθ(y) y log p(y|x) 2 2 (30)

For Gaussian MNMs, the MDSM learning objective simpliﬁes to:

m=1 Jm(θ), where Jm = E(x,y) mfθ(y) + ym x

C MUVB: AN UNNORMALIZED LATENT VARIABLE MODEL

In this section we give a formulation of the energy function parametrization scheme which is named multimeasurement unnormalized variational Bayes (MUVB). Conceptually, MUVB is motivated by the goal of connecting empirical Bayes denoising framework with variational Bayes inference machinery. There seem to be fundamental challenges here since one model is based on leastsquares estimation, the other based on the principle of maximum likelihood. In what follows, this fundamental conﬂict takes the shape of latent variable model that is unnormalized. There have been recent sudies along these lines (Saremi, 2020a;b) that we expand on; in particular we introduce the novel metaencoder. At a high level, the idea here is to set up a latent variable model for M-densities and use the variational free energy as the energy function. We refer to (Jordan et al., 1999) for an introduction to variational methods. In what follows, we use the lighter notation for σ M models (Remark 2) that we also used in Sec. 4.

In its simplest form, the latent variable model for M-densities is set up by the following choices:

The prior over latent variables which we take to be Gaussian

p(z) = N(z; 0, Idz).

The approximate posterior over latent variables qφ(z|y) which is taken to be the factorized Gaussian, a standard choice in the literature. For the conditional density pη(y|z) we can consider the following

pη(y|z) = N(y; νη(z), σ2IMd),

Published as a conference paper at ICLR 2022

which may seem as a especially natural choice since samples from the M-density are obtained by adding multimeasurement noise to clean samples.

In the variational autoencoder parlance, φ are the encoder s parameters, η the decoder s parameters, and ν = (ν1, . . . , νM) the decoder s outputs. With this setup, the variational free energy is easily derived which we take to be the energy function as follows (θ = (φ, η)):

fθ(y) = 1 2σ2 Eqφ(z|y) y νη(z) 2 2 + KL(qφ(z|y) p(z)),

where the KL divergence is derived in closed form, and the expectation is computed by sampling z qφ(z|y) via the reparametrization trick (Kingma & Welling, 2013).

In this machinery, in the inference step we are forced to have normalized (and tractable) posteriors since we must take samples from qφ(z|y). What about the conditional density p(y|z)? If we were to formalize a VAE for modeling M-densities we had to have a normalized conditional density p(y|z) since that framework is based on the principle of maximum likelihood. What is intriguing about our setup is that since our goal is to learn the unnormalized p(y), we can consider conditional densities p(y|z) that are unnormalized. An intuitive choice is the following that we name metalikelihood:

p(η,ζ)(y|z) exp 1

2σ2 y νη(z) 2 2 hζ(y, νη(z)) .

Now, the energy function takes the following form (θ = (φ, η, ζ)):

fθ(y) = Eqφ(z|y)

2σ2 y νη(z) 2 2 + hζ(y, νη(z) + KL(qφ(z|y) p(z)). (32)

This is the ﬁnal form of MUVB energy function. We refer to hζ by metaencoder. Its input is (y, ν(z; η)) (note that z itself is a function of y via inference), and one can simply use any encoder architecture to parametrize it. In our implementation, we used the standard VAE pipeline and we reused the encoder architecture to parametrize the metaencoder.

D TWO WALK-JUMP SAMPLING ALGORITHMS

In this section we provide the walk-jump sampling (WJS) algorithm based on different discretizations of underdamped Langevin diffusion (Eq. 23). The ﬁrst is due to Sachs et al. (2017) which has also been used by Arbel et al. (2020) for generative modeling. The second is based on a recent algorithm analyzed by Cheng et al. (2018) which we implemented and give the detailed algorithm here. In addition, we extended their calculation to general friction (they set γ = 2 in the paper) and provide an extension of the (Cheng et al., 2018, Lemma 11) for completeness (the proof closely follows the calculation in the reference). In both cases we provide the algorithm for σ M models.

D.1 WALK-JUMP SAMPLING ALGORITHM I

Algorithm 1: Walk-jump sampling (Saremi & Hyv arinen, 2019) using the discretization of Langevin diffusion by Sachs et al. (2017). The for loop corresponds to the dashed arrows (walk) in Fig. 2 and line 14 ( is computed over the measurement indices m) to the solid arrow (jump).

1: Input δ (step size), u (inverse mass), γ (friction), K (steps taken) 2: Input Learned energy function f(y) or score function g(y) log p(y) 3: Ouput b XK 4: Y0 Unif([0, 1]Md) 5: V0 0 6: for k = [0, . . . , K) do 7: Yk+1 Yk + δVk/2 8: Ψk+1 yf(Yk+1) or Ψk+1 g(Yk+1) 9: Vk+1 Vk + uδΨk+1/2 10: Bk+1 N(0, IMd) 11: Vk+1 exp( γδ)Vk+1 + uδΨk+1/2 + p

u (1 exp( 2γδ))Bk+1 12: Yk+1 Yk+1 + δVk+1/2 13: end for 14: b XK YK,m σ2 mf(YK) or b XK YK,m + σ2gm(YK)

Published as a conference paper at ICLR 2022

D.2 WALK-JUMP SAMPLING ALGORITHM II

Before stating the algorithm, we extend the calculation in Cheng et al. (2018, Lemma 11) to general friction, which we used in our implementation of their Langevin MCMC algorithm. The algorithm is stated following the proof of the lemma which closely follows Cheng et al. (2018).

The starting point is solving the diffusion Eq. 23. With the initial condition (y0, v0), the solution (yt, vt) to the discrete underdamped Langevin diffusion is (t is considered small here)

vt = v0e γt u Z t

0 e γ(t s) f(y0)ds + p

0 e γ(t s)d Bs,

yt = y0 + Z t

These equations above are then used in the proof of the following lemma. Lemma 1. Conditioned on (y0, v0), the solution of underdamped Langevin diffusion (Eq. 23) integrated up to time t is a Gaussian with conditional means E[vt] = v0e γt γ 1u(1 e γt) f(y0)

E[yt] = y0 + γ 1(1 e γt)v0 γ 1u t γ 1(1 e γt) f(y0), (34)

and with conditional covariances E[(yt E[yt])(yt E[yt]) ] = γ 1u 2t 4γ 1(1 e γt) + γ 1(1 e 2γt) IMd,

E[(vt E[vt])(vt E[vt]) ] = u(1 e 2γt) IMd,

E[(yt E[yt])(vt E[vt]) ] = γ 1u(1 2e γt + e 2γt) IMd.

Proof. Computation of conditional means is straightforward as the Brownian motion has zero means: E[vt] = v0e γt γ 1u(1 e γt) f(y0)

E[yt] = y0 + γ 1(1 e γt)v0 γ 1u t γ 1(1 e γt) f(y0), All the conditional covariances only involve the Brownian motion terms. We start with the simplest:

E[(vt E[vt])(vt E[vt]) ] = 2γu E

0 e γ(t s)d Bs

0 e γ(t s)d Bs

0 e 2γ(t s)ds IMd

= u(1 e 2γt) IMd

From Eq. 33, the Brownian motion term for yt is given by p

0 e γ(t s)d Bs

s e γrdr d Bs

1 e γ(t s) d Bs.

The conditional covariance for yt follows similar to Eq. 36:

E[(yt E[yt])(yt E[yt]) ] = 2γ 1u Z t

1 e γ(t s) 2 ds IMd

= γ 1u 2t 4γ 1(1 e γt) + γ 1(1 e 2γt) IMd

Finally the conditional covariance between yt and vt is given by

E[(yt E[yt])(vt E[vt]) ] = 2u E

1 e γ(t s) d Bs

0 e γ(t s)d Bs

1 e γ(t s) e γ(t s)ds IMd

= γ 1u(1 2e γt + e 2γt) IMd

Published as a conference paper at ICLR 2022

Algorithm 2: Walk-Jump Sampling (WJS) using the discretization of Langevin diffusion from Lemma 1. Below, we provide the Cholesky decomposition of the conditional covariance matrix.

1: Input δ (step size), u (inverse mass), γ (friction), K (steps taken) 2: Input Learned energy function f(y) or score function g(y) log p(y) 3: Ouput b XK 4: Y0 Unif([0, 1]Md) 5: V0 0 6: for k = [0, . . . , K) do 7: Ψk yf(Yk) or Ψk g(Yk) 8: Vk+1 Vke γδ + γ 1u(1 e γδ)Ψk 9: Yk+1 Yk + γ 1(1 e γδ)Vk + γ 1u δ γ 1(1 e γδ) Ψk 10: Bk+1 N(0, I2Md)

+ LBk+1 // see Eq. 38

12: end for 13: b XK YK,m σ2 mf(YK) or b XK YK,m + σ2gm(YK)

The Cholesky decomposition Σ = LL for the conditional covariance in Lemma 1 is given by:

Σ1/2 yy IMd 0 Σ 1/2 yy Σyv IMd (Σvv Σ2 yv/Σyy)1/2 IMd

Σyy = γ 1u 2δ 4γ 1(1 e γδ) + γ 1(1 e 2γδ)

Σvv = u(1 e 2γδ) (40)

Σyv = γ 1u(1 2e γδ + e 2γδ) (41)

Algorithm 1 vs. Algorithm 2 The two algorithms presented in this section have very different properties in our high dimensional experiments. In general we found Algorithm 1 (A1) to behave similarly to overdamped version (see Appendix F for some illustrations) albeit with much faster mixing. We found friction γ to play a very different role in comparing the two algorithms, also apparent in comparing the mathematical expressions in both algorithms: in A1, the friction only appears in the form exp( γδ); it is more complex in A2. Similar discrepancy is found regarding u. Remark 4 (Initialization scheme). The initialization Y0 Unif([0, 1]Md) used in the algorithms is the conventional choice, but for very large σ we found it effective to initialize the chain by further adding the Gaussian noise in RMd:

Y0 = ϵ0 + ε0, where ϵ0 Unif([0, 1]Md), ε0 N(0, σ2Id) Otherwise, with large step sizes δ = O(σ) the chains starting with the uniform distribution could break early on. We found this scheme reliable, and it is well motivated since M-density p(y) is obtained by convolving p(x) with the Gaussian MNM. With this initialization one starts relatively close to the M-density manifold and this is more pronounced for larger σ. However, this initialization scheme is based on heuristics (as is the conventional one). Remark 5 (The jump in WJS is asynchronous). The jump in WJS is asynchronous (Fig. 2) and it can also be taken inside the for loop in the algorithms provided without affecting the Langevin chain. Remark 6 (The k notation). In running long chains we set K to be large, and we report the jumps (taken inside the for loops in the algorithms provided) with a certain ﬁxed frequency denoted by k.

Remark 7 (WJS for Poisson M-densities). WJS is a general sampling algorithm, schematized in Fig. 2. However, for Poisson MNMs it will involve sampling a discrete distribution in NMd which we do not know how to do efﬁciently in high dimensions (note that Langevin MCMC cannot be used since the score function is not well deﬁned for discrete distributions). We can use Metropolis Hastings algorithm and its popular variant Gibbs sampling but these algorithms are very slow in high dimensions since at each iteration of the sampling algorithm in principle only 1 dimension out of Md shall be updated, so in the general case the mixing time is at best of O(Md); one can use blocking techniques but they are problem dependent and typically complex (Andrieu et al., 2003).

Published as a conference paper at ICLR 2022

E EXPERIMENTAL SETUP

Datasets Our experiments were conducted on the MNIST (Le Cun et al., 1998), CIFAR-10 (Krizhevsky, 2009) and FFHQ-256 (Karras et al., 2019; pre-processed version by Child, 2020) datasets. The CIFAR-10 and FFHQ-256 training sets were augmented with random horizontal ﬂips during training.

Network Architectures For all experiments with the MDAE and MEM2 parametrization, the U2Net network architecture (Qin et al., 2020) was chosen since it is a recent and effective variant of the UNet (Ronneberger et al., 2015) that has been successfully applied to various pixel-level prediction problems. However, we have not experimented with alternatives. Minor adjustments were made to the U2Net from the original paper: we removed all normalization layers, and switched the activation function to x 7 x sigmoid(x) (Elfwing et al., 2017; Ramachandran et al., 2017). To approximately control the variance of inputs to the network in each dimension, even if the noise level σ is large, the noisy input values were divided by a scaling factor

0.2252 + σ2. We also adjusted the number of stages in the network, stage heights and layer widths (see Table 1 for details), in order to ﬁt reasonable batch sizes on available GPU hardware and keep experiment durations limited to a few days (see Table 2).

For experiments with the MUVB parametrization (the MNIST model in Sec. G.3), we used Residual convolutional networks utilizing a bottleneck block design with skip connections and average pooling layers (Lin et al., 2013; Srivastava et al., 2015; He et al., 2016). The architecture of the encoder and metaencoder were kept the same, while the decoder used the encoder architecture in reverse, with the pooling layers substituted with nearest neighbour upsampling. See Table 3 for a detailed description of the encoder architecture.

Remark 8. Strictly speaking, for σ M models the energy function is permutation invariant and y should be treated as {y1, . . . , y M}, a set consisting of M Euclidean vectors. This is however difﬁcult to enforce while keeping the model expressive. We note that in our model the permutation symmetry is a fragile symmetry which one can easily break by making σm just slightly different from each other. Therefore, we designed the architecture to be used for the general setting of σm different from each other, and in experiments we just set them to be equal.

Training All models were trained using the Adam (Kingma & Ba, 2014) optimizer with the default setting of β1 = 0.9, β2 = 0.999 and ϵ = 10 8. Table 2 lists the main hyperparameters used and hardware requirements for training MDAE models for each dataset. The resulting training curves are shown in Fig. 6, showing the stability of the optimization with a relatively simple setup.

Software All models were implemented in the CPython (v3.8) library Py Torch v1.9.0 (Paszke et al., 2017) using the Py Torch Lightning framework v1.4.6 (Falcon et al.).

Table 1: U2Net architecture used for MDAE and MEM2 parametrizations for all datasets. All layer widths (number of channels) were expanded by a width factor for CIFAR-10 and FFHQ-256.

Stage Height In channels Mid channels Out channels Side

Encoder 1 6 3 or 1 64 64 Encoder 2 5 64 128 128 Encoder 3 4 128 128 256 Encoder 4 3 256 256 512 Encoder 5 3 512 256 512 512 Decoder 4 3 1024 128 256 256 Decoder 3 3 512 128 128 128 Decoder 2 5 256 128 64 64 Decoder 1 6 128 64 64 64

Published as a conference paper at ICLR 2022

0 200 400 600 800 1000 Epoch

(a) MNIST 1 16

0 200 400 600 800 1000 Epoch

(b) CIFAR-10 1 8

0 50 100 150 200 250 Epoch

(c) FFHQ-256 4 8

Figure 6: MDAE training plots demonstrate that optimization is stable for all datasets. Loss values on the y-axis in these plots represent the value of our objective function (Eq. 14).

Table 2: Main hyperparameters and computational resources used for training MDAE models.

MNIST 1 16 CIFAR-10 1 8 FFHQ-256 4 8

Width factor 1 2 2 Learning rate 0.00002 0.00002 0.00002 Total batch size 128 256 16 GPUs 1 GTX Titan X 4 GTX Titan X 4 V100 Run Time 2.3 days (1000 epochs) 2 days (1000 epochs) 2.75 days (250 epochs)

Table 3: MUVB Encoder architecture for MNIST. Each row indicates a sequence of transformations in the ﬁrst column, and the spatial resolution of the resulting output in the second column. Block{N} {D} denotes a sequence of D blocks, each consisting of four convolutional layers with a skip connection using a bottleneck design: the layers have widths [N, 0.25 N, 0.25 N, N] and kernel sizes [1, 3, 3, 1], except when the input resolution becomes smaller than 3, in which case all the kernel sizes are 1.

Module Output resolution

Input 28 28 Block128 2 28 28 Average Pool 14 14 Block256 2 14 14 Average Pool 7 7 Block512 2 7 7 Average Pool 4 4 Block1024 2 4 4 Average Pool 2 2 Block1024 2 2 2 Average Pool 1 1 Block1024 1 1 1

Published as a conference paper at ICLR 2022

F LANGEVIN MCMC LABORATORY

In this section, we give a light tutorial on Langevin MCMC used in the walk phase of WJS. Perhaps the most interesting result enabled by multimeasurement generative models is that Langevin MCMC in high dimensions becomes stable enough for us to examine and dissect the performance of various Langevin MCMC variants in practice.

The reader might ask why we used underdamped Langevin MCMC versus its well-known overdamped version. The reason is the recent theoretical developments discussed in Sec. 5 that show much faster dimension dependence for the convergence (mixing time), O(

d) vs. O(d), for underdamped Langevin MCMC. To give an idea regarding the dimension, for the 4 8 model on FFHQ-256, Md 106. Our trained model for 4 8 is used throughout this section. In addition, the MCMC is initialized with the ﬁxed random seed throughout this section.

F.1 OVERDAMPED LANGEVIN MCMC

We start with overdamped Langevin MCMC which is simpler and give us some intuitions on the behavior of Langevin MCMC in general. The algorithm is based on the following stochastic iterations (setting inverse mass u = 1):

2 f(yk) | {z } Ak

+ δ εk | {z }

, where εk N(0, IMd). (42)

Comparing above with Eq. 7, it is intuitive to consider setting δ = O(σ). We start with δ = σ: in this case Ak pulls noisy data to clean manifold (note the important extra factor of 1/2 however) and Bk term corresponds to adding multimeasurement noise sampled from N(0, σ2IMd). The overall effect is that one starts walking on the manifold of the M-density. And if that is the case, the jumps computed by bx(yk) should give samples close to the manifold of the (clean) data distribution samples from p(x). We start with that experiment by setting δ = σ. Note that in walk-jump sampling we sample the M-density in RMd generating highly noisy data for σ = 4; we only show clean (jump) samples in Rd. This is shown next by skipping every 30 steps, i.e. k = 30:

Our intuition was partly correct in that the sampler manages to capture modes but the step size is clearly too high. Now consider δ = σ/2 and k = 100:

Continuing the chain, now shown for k = 1000, we arrive at

Theoretically, to converge to the true distribution (associated with the M-density) the step size should be small, where small is dictated by how close one wishes to be to the true distribution, e.g. see (Cheng et al., 2018, Theorem 1). As we see above, even for δ = 2 the sampler does not converge after 1000 steps, but a ﬁxed δ = σ/2 appears to be too high for this model. By continuing with the chain with the step size δ = σ/2 one starts to gradually move away from the manifold as shown next:

Published as a conference paper at ICLR 2022

F.2 UNDERDAMPED LANGEVIN MCMC

In this section we focus on underdamped Langevin MCMC. The algorithms are based on discretizing underdamped Langevin diffusion (Eq. 23). In this paper, we considered two such discretization schemes, by Sachs et al. (2017) that we used in Algorithm 1, and the one due to Cheng et al. (2018) that we implemented and used in Algorithm 2. The ﬁrst algorithm is not analyzed but it is very easy to show that in the limit γ it will be reduced to the overdamped Langevin MCMC we just discussed. Second algorithm is more complex in how friction γ and inverse mass u behave. For comparison, we present results obtained by both algorithms, showing Algorithm 1 ﬁrst. As in the previous section, we start with δ = σ; we set γ = 1 (u = 1 unless stated otherwise) with k = 30:

As in the previous section δ = σ is too high for FFHQ-256, 4 8 model. (However, note the relative stability of Algorithm 2, for this random seed.) Next, we consider δ = σ/2, γ = 1, shown with k = 100:

Note the remarkable contrast between the results above and the corresponding ﬁgure (for k = 100) in the previous section. We continue the chain with the same k = 100:

Both algorithms start to diverge for this choice of parameters (δ = σ/2, γ = 1). To add more stability we can consider increasing the mass (lowering u). In (Cheng et al., 2018), u is replaced with 1/L, where L is the Lipschitz constant of the score function. By lowering u, we are telling the algorithm that the score function is less smooth and MCMC uses more conservative updates. This can also be seen in the equations in both algorithms, where the step size δ is multiplied by u in several places. However, note this effect is more subtle than just simply scaling δ, especially so for Algorithm 2 as can be seen by inspection. Next, we consider u = 1/2. This will affect the mixing time and the chains are now shown with k = 1000 (instead of k = 100 earlier):

This experiment concludes this section. We come back to more FFHQ-256 chains in Appendix I.

Published as a conference paper at ICLR 2022

G.1 MORE IS DIFFERENT

There is a notion of congruence which arise in our analysis in Sec. 2.3 which we denote by = : two different models, σ M and σ M , agree with each other in terms of the plug-in estimator

σ M = σ M if σ

However these models are vastly different, one is in RMd, the other in RM d. And if σ σ we expect that σ M model to have better properties as a generative model. This is clear in the regime (σ 1, M = 1). In that regime, almost no learning occurs: at test time MCMC either gets stuck in a mode for a long time (the best case scenario) or simply breaks down. We refer to (Saremi & Hyv arinen, 2019) for a geometric intuition on why large noise is essential in these models regarding the walk-jump sampling. We should point out that this is a general phenomenon in all denoising objectives (Alain & Bengio, 2014; Song & Ermon, 2020). In a nutshell, when noise is small the Langevin sampler ﬁnds it difﬁcult to navigate in high dimensions and the forces of the Brownian motion eventually takes it to a part of space unknown to our model and MCMC breaks down (the ﬁrst experiment below). If we are lucky we can ﬁnd the manifold initially (the second experiment below) and MCMC may remain stable for a while but a model with σ > σ will have better mixing properties. We should highlight that in the analysis of Cheng et al. (2018) the mixing time scales as κ2 where κ is the condition number (see Sec. 1.4.1) proportional to the Lipschitz constant; this quantiﬁes the intuition that MCMC on smoother densities (smaller Lipschitz constant) mixes faster.

To validate these intuitions we consider comparing 1 16 and 1/4 1. In addition to statistically identical plug-in estimators, these models also arrive at similar training losses with the same training schedules (1.03 for 1 16 and 1.19 for 1/4 1). We start with 1/4 1. Below we show WJS using Algorithm 2 with the setting (δ = 1/8, γ = 1/2, u = 1) (the setting δ = σ/2, γ = 1/2, u = 1 is a simple choice we used in testing all our σ M models). Here, MCMC breaks down shortly after the last step shown here (this chain is shown with k = 30, in the remainder k = 500):

Algorithm 1 is more stable here and we arrive at the following chain (total of 105 steps):

Now, for 1 16 model, using Algorithm 2 with parameters (δ = 1/2, γ = 1/2, u = 1), we arrive at:

And using Algorithm 1 with the same parameters (and initial seed) we arrive at:

(Note the stark differences in terms of mixing between Algorithms 1 and 2 for this model.)

In summary, more is different: quantitative differences become qualitative ones (Anderson, 1972) and this is highlighted here for congruent σ M models that differ in terms of the noise level. If computation is not an issue, one should always consider models with larger σ but this remains a conjecture here in its limiting case (as σ and M grow unbounded, while σ/

M remains small ).

Published as a conference paper at ICLR 2022

G.2 INCREASING M: TIME COMPLEXITY, SAMPLE QUALITY, AND MIXING TIME TRADE-OFFS

Here we extend the ablation study in the previous section by studying σ M models with a ﬁxed σ where we increase M. In the previous section we studied two models where bx(y) has the same statistical properties as measured by the loss (and the plug-in estimator), and we validated the hypothesis that MCMC sampling has better mixing properties for the smoother M-density (with larger σ). Now, the question is what if we keep σ ﬁxed and increase M?

As we motivated in Sec. 1, by increasing M the posterior p(x|y) concentrates on the mean bx(y), thus increasing the sample quality in WJS. However there are trade-offs: (i) The ﬁrst is the issue of time complexity which is an open problem here. In our architecture, the M measurements are passed through a convolutional layer and after that models with different M have approximately the same time complexity. The problem is this may not be an ideal architecture. Presumably, one should increase the network capacity when increasing M (for the same σ) but by how much we do not know (this clearly depends on how complex p X is). Here we ignore this issue and we assume that our architecture has enough capacity for the largest M = 16 considered. (ii) The second trade-off is as we increase M, the price we pay for higher sample quality is that we will have longer mixing times, both in terms of generating the ﬁrst sample, and more importantly in the mixing between modes. This trade-off is intuitive since p(y1, . . . , y M) becomes more complex for larger M (for ﬁxed σ). Below we compare WJS chains (as before, without warmup) for σ = 1 and M = 1, 2, 4, 16:

Figure 7: (mixing time vs. image quality trade-off) WJS chains for 1 M models are presented in real time ( k = 1) starting from noise (320 steps in total), for M = 1, 2, 4, 16 in order from top pannel to the bottom one. We used Algorithm 2 (δ = 1/2, γ = 1/2, u = 1) with the same initial seed. Note that, at the cost of sample quality, all classes are visited in 1 1 in the short chain presented.

Published as a conference paper at ICLR 2022

G.3 MUVB VS. MDAE

Below we make a one to one comparison between MUVB and MDAE for the 1 4 setting we trained on MNIST. In both cases we ran a single chain for 1 million+ steps. For sampling M-density we used the Langevin MCMC algorithm by Sachs et al. (2017) with parameters (δ = 1, γ = 1/4, u = 1) (see Appendix D). This is an aggressive choice of parameters for Langevin MCMC, designed for fast mixing, yet the chains do not break up to 1M+ steps (we stopped them due to computational reasons).

Below we show the two chains at discrete time resolution of 5 steps (the ﬁrst sample, top left corner, is obtained after only 5 steps). This comparison visually demonstrates that MUVB has better mixing and MDAE better sample quality. All digit classes are visited in MUVB in a variety of styles in 4000 steps shown here. MDAE chain is cleaner but the classes {0, 1, 5, 6} are visited scarcely.

Figure 8: MUVB, 1 4 model on MNIST, k = 5 steps.

Figure 9: MDAE, 1 4 model on MNIST, k = 5 steps.

Published as a conference paper at ICLR 2022

G.4 LIFELONG MARKOV CHAIN

We demonstrate the WJS chain with 1 million+ steps obtained in MDAE (1 4) in its entirety (viewed left to right, top to bottom). We informally refer to these long chains that never break as lifelong Markov chains. The chain is reported here in its entirety to demonstrate the fact that it does not get stuck in a mode and remains stable through the end (we stopped it due to computational reasons).

Figure 10: MDAE, 1 4 model on MNIST, k = 500 steps, 1 million+ steps in total.

Published as a conference paper at ICLR 2022

H CIFAR-10: SINGLE-CHAIN FID EVALUATION & SOME FAILURE MODES

The main fear in parametrizing the score function directly in MSM, which MDAE is an instance of, is we are not guaranteed to learn a conservative score function (see Sec. 4.1). More importantly, we do not have a control over how the learned model is failing in that regard. These fears were realized in our CIFAR experiments on the 1 8 model presented below. The main results use Algorithm 2. Algorithm 1 simply fails for the Langevin MCMC parameters we experimented with below (for completeness we present a short chain at the bottom panel). In 2D toy experiments the non-conservative score functions were quantiﬁed in (Saremi et al., 2018, Fig. 1h), but this quantiﬁcation is difﬁcult in high dimensions. However, what is intriguing about the experiments here is that the non-conservative nature of the score function is manifested this is merely a conjecture in the cyclic mode visits apparent in the Markov chain.

Figure 11: Top panels use Algorithm 2 at two different checkpoints. The bottom panel uses Algorithm 1. MCMC parameters are (δ = 0.1, γ = 2, u = 10) for the top and bottom panel, and u = 20 for the middle panel. In all cases k = 500 ( 2 105 steps). The FID scores are respectively 91 & 79.

Published as a conference paper at ICLR 2022

We encountered the same problem using spectral normalization (Miyato et al., 2018): the chain below was obtained using Algorithm 2 with the setting (δ = 1/2, γ = 1/2, u = 1) where we got the FID score of 99. We ran the chain for 1 M steps; the ﬁrst 600 K steps are visualized below:

Figure 12: WJS chain for MDAE 1 8, trained with spectral normalization, k = 500.

Published as a conference paper at ICLR 2022

We also experimented with training MDAE using SGD optimizer. Using Algorithm 1 with the setting (δ = 0.7, γ = 1, u = 1) the MCMC chain was stable for 1 M steps where we stopped the chain. We picked 50 K images at equal interval of k = 20 from the chain resulting in an FID of 43.95:

Figure 13: The ﬁrst 13 K steps of a WJS chain with 1 M steps for MDAE 1 8 on CIFAR-10 with k = 10. We obtained the FID score of 43.95 by selecting 50 K samples skipping k = 20 steps.

Published as a conference paper at ICLR 2022

We ﬁnish this section with a discussion on numerical comparisons on the sample quality between our MDAE 1 8 model and other recent MCMC-based methods, summarized in Table 4:

The table excludes denoising diffusion models since the MCMC chain in such models are conditional in the sense that sampling is via reversing a diffusion process based on a sequence of conditional distributions which is learned during training (see the discussion in Sec. 7). By comparison, in all the papers in the table below there is only one energy/score function that is being learned.

The use of the term long chain below is based on the current status of the ﬁeld, chains of order 1,000 steps: the MCMC chains used for generation and reporting FID scores in most prior works has been short , of order 50 to 100 steps by comparison which we have indicated in a column in the table. The FID scores in those papers were obtained by 50 K short-run parallel chains.

Our work is unique in this literature in that for the ﬁrst time we report competitive FID scores obtained from a single MCMC chain. As we emphasized in Sec. 6 the quest for generating very long life long chains with good sample quality and mixing properties, as measured by the FID score (Heusel et al., 2017), is a scientiﬁc challenge and we believe it is an ultimate test to demonstrate that one has found a good approximation to the true energy/score function. We should point out that the only competitive paper here in obtaining FID scores with long MCMC chains is by Nijkamp et al. (2022) where the long chains have been limited to 2,000 steps and the FID obtained is much worse (78.12) than what we obtained (43.95) with our 1 8 model from a single MCMC chain of 1,000,000 steps (see Fig. 13).

Table 4: FID results for unconditional MCMC-based sample generation on CIFAR-10

FID long chain single chain MCMC steps

Xie et al. (2018) 35.25 N/A Nijkamp et al. (2019) 23.02 100 Du & Mordatch (2019) 40.58 60 Zhao et al. (2020) 16.71 60 Xie et al. (2021) 36.20 50 Nijkamp et al. (2022) 78.12 2,000 MDAE, 1 8 (our work) 43.95 1,000,000

In this section, we provide several WJS chains for our MDAE 4 8 model on FFHQ-256 dataset. We refer to Algorithm 1 by A1 and Algorithm 2 by A2, and MCMC parameters are listed as (δ, γ, u).

A1; (2, 1, 1); k = 300

A2; (2, 1, 1); k = 200

Published as a conference paper at ICLR 2022

A2; (1/10, 1, 10); k = 500

A2; (4, 1/2, 1/4); k = 150

A1; (2, 1/2, 1); k = 100

A1; (2, 1/2, 1); k = 20