# generalized_energy_based_models__0b9ad2cf.pdf

Published as a conference paper at ICLR 2021

GENERALIZED ENERGY BASED MODELS

Michael Arbel , Liang Zhou & Arthur Gretton Gatsby Computational Neuroscience Unit, University College London

We introduce the Generalized Energy Based Model (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally animplicitmodel), whichcanlearnthesupportofdatawithlowintrinsicdimensionin ahighdimensionalspace; andanenergyfunction, toreﬁnetheprobabilitymassonthe learned support. Both the energy function and base jointly constitute the ﬁnal model, unlike GANs, which retain only the base distribution (the "generator"). GEBMs are trained by alternating between learning the energy and the base. We show that both training stages are well-deﬁned: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model canbeobtainedvia MCMC,thusﬁndingregionsinthisspacethatproducebetterquality samples. Empirically, the GEBM samples on image-generation tasks are of much better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. When using normalizing ﬂows as base measures, GEBMs succeed on density modelling tasks, returning comparable performance to direct maximum likelihood of the same networks.

1 INTRODUCTION

Energy-based models (EBMs) have a long history in physics, statistics and machine learning (Le Cun et al., 2006). They belong to the class of explicit models, and can be described by a family of energies E which deﬁne probability distributions with density proportional to exp( E). Those models are often known up to a normalizing constant Z(E), also called the partition function. The learning task consists of ﬁnding an optimal function that best describes a given system or target distribution P. This can be achieved using maximum likelihood estimation (MLE), however the intractability of the normalizing partition function makes this learning task challenging. Thus, various methods have been proposed to address this (Hinton, 2002; Hyvärinen, 2005; Gutmann and Hyvärinen, 2012; Dai et al., 2019a;b). All these methods estimate EBMs that are supported over the whole space. In many applications, however, P is believed to be supported on an unknown lower dimensional manifold. This happens in particular when there are strong dependencies between variables in the data (Thiry et al., 2021), and suggests incorporating a low-dimensionality hypothesis in the model .

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are a particular way to enforce low dimensional structure in a model. They rely on an implicit model, the generator, to produce samples supported on a low-dimensional manifold by mapping a pre-deﬁned latent noise to the sample space using a trained function. GANs have been very successful in generating high-quality samples on various tasks, especially for unsupervised image generation (Brock et al., 2018). The generator is trained adversarially against a discriminator network whose goal is to distinguish samples produced by the generator from the target data. This has inspired further research to extend the training procedure to more general losses (Nowozin et al., 2016; Arjovsky et al., 2017; Li et al., 2017; Bi nkowski et al., 2018; Arbel et al., 2018) and to improve its stability (Miyato et al., 2018; Gulrajani et al., 2017; Nagarajan and Kolter, 2017; Kodali et al., 2017). While the generator of a GAN has effectively a low-dimensional support, it remains challenging to reﬁne the distribution of mass on that support using pre-deﬁned latent noise. For instance, as shown by Cornish et al. (2020) for normalizing ﬂows, when the latent distribution is unimodal and the target distribution possesses multiple disconnected low-dimensional components, the generator, as a continuous map, compensates for this mismatch using steeper slopes. In practice, this implies the need for more complicated generators.

Correspondence: michael.n.arbel@gmail.com.

Published as a conference paper at ICLR 2021

In the present work, we propose a new class of models, called Generalized Energy Based Models (GEBMs), which can represent distributions supported on low-dimensional manifolds, while offering more ﬂexibility in reﬁning the mass on those manifolds. GEBMs combine the strength of both implicit and explicit models in two separate components: a base distribution (often chosen to be an implicit model) which learns the low-dimensional support of the data, and an energy function that can reﬁne the probability mass on that learned support. We propose to train the GEBM by alternating between learning the energy and the base, analogous to f-GAN training (Goodfellow et al., 2014; Nowozin et al., 2016). The energy is learned by maximizing a generalized notion of likelihood which we relate to the Donsker-Varadhan lower-bound (Donsker and Varadhan, 1975) and Fenchel duality, as in (Nguyen et al., 2010; Nowozin et al., 2016). Although the partition function is intractable in general, we propose a method to learn it in an amortized fashion without introducing additional surrogate models, as done in variational inference (Kingma and Welling, 2014; Rezende et al., 2014) or by Dai et al. (2019a;b). The resulting maximum likelihood estimate, the KL Approximate Lower-bound Estimate (KALE), is then used as a loss for training the base. When the class of energies is rich and smooth enough, we show that KALE leads to a meaningful criterion for measuring weak convergence of probabilities. Following recent work by Chu et al. (2020); Sanjabi et al. (2018), we show that KALE possesses well deﬁned gradients w.r.t. the parameters of the base, ensuring well-behaved training. We also provide convergence rates for the empirical estimator of KALE when the variational family is sufﬁciently well behaved, which may be of independent interest.

The main advantage of GEBMs becomes clear when sampling from these models: the posterior over the latents of the base distribution incorporates the learned energy, putting greater mass on regions in this latent space that lead to better quality samples. Sampling from the GEBM can thus be achieved by ﬁrst sampling from the posterior distribution of the latents via MCMC in the low-dimensional latent space, then mapping those latents to the input space using the implicit map of the base. This is in contrast to standard GANs, where the latents of the base have a ﬁxed distribution. We focus on a class of samplers that exploit gradient information, and show that these samplers enjoy fast convergence properties by leveraging the recent work of Eberle et al. (2017). While there has been recent interest in using the discriminator to improve the quality of the generator during sampling (Azadi et al., 2019; Turner et al., 2019; Neklyudov et al., 2019; Grover et al., 2019; Tanaka, 2019; Wu et al., 2019b), our approach emerges naturally from the model we consider.

We begin in Section 2 by introducing the GEBM model. In Section 3, we describe the learning procedure using KALE, then derive a method for sampling from the learned model in Section 4. In Section 5 we discuss related work. Finally, experimental results are presented in Section 6 with code available at https://github.com/Michael Arbel/Generalized EBM.

2 GENERALIZED ENERGY-BASED MODELS

Figure 1: Data generating distribution supported on a line and with higher density at the extremities. Models are learned using either a GAN, GEBM, or EBM. More details are provided in Appendix G.3.

In this section, we introduce generalized energy based models (GEBM), that combine the strengths of both energy-based models and implicit generative models, and admit the ﬁrst of these as a special case. An energy-based model (EBM) is deﬁned by a set E of real valued functions called energies, where each E E speciﬁes a probability density over the data space X Rd up to a normalizing constant,

Q(dx)=exp( E(x) A)dx, A=log ÅZ exp( E(x))dx ã . (1)

While EBMs have been shown recently to be powerful models for representing complex high dimensional data distributions, they still unavoidably lead to a blurred model whenever data are concentrated on a lower-dimensional manifold. This is the case in Figure 1(a), where the ground truth distribution is

Published as a conference paper at ICLR 2021

supported on a 1-D line and embedded in a 2-D space. The EBM in Figure 1(d) learns to give higher density to a halo surrounding the data, and thus provides a blurred representation. That is a consequence of EBM having a density deﬁned over the whole space, and can result in blurred samples for image models.

An implicit generative model (IGM) is a family of probability distributions Gθ parametrized by a learnable generator function G : Z 7 X that maps latent samples z from a ﬁxed latent distribution η to the data space X. The latent distribution η is required to have a density over the latent space Z and is often easy to sample from. Thus, Sampling from G is simply achieved by ﬁrst sampling z from η then applying G,

x G x=G(z), z η. (2)

GANs are popular instances of these models, and are trained adversarially (Goodfellow et al., 2014). When the latent space Z has a smaller dimension than the input space X, the IGM will be supported on a lower dimensional manifold of X, and thus will not possess a Lebesgue density on X (Bottou et al., 2017). IGMs are therefore good candidates for modelling low dimensional distributions. While GANs can accurately learn the low-dimensional support of the data, they can have limited power for representing the distribution of mass on the support. This is illustrated in Figure 1(b).

A generalized energy-based model (GEBM) Q is deﬁned by a combination of a base G and an energy E deﬁned over a subset X of Rd. The base component can typically be chosen to be an IGM as in (2). The generalized energy component can reﬁne the mass on the support deﬁned by the base. It belongs to a class E of real valued functions deﬁned on the input space X, and represents the negative log-density of a sample from the GEBM with respect to the base G,

Q(dx)=exp( E(x) AG,E)G(dx), AG,E =log ÅZ exp( E(x))G(dx) ã , (3)

where AG,E is the logarithm of the normalizing constant of the model w.r.t. G. Thus, a GEBM Q re-weights samples from the base according to the un-normalized importance weights exp( E(x)). Using the latent structure of the base G, this importance weight can be pulled-back to the latent space to deﬁne a posterior latent distribution ν,

ν(z):=η(z)exp( E(G(z)) AG,E). (4)

Hence, the posterior latent ν can be used instead of the latent noise η for sampling from Q, as summarized by Proposition 1: Proposition 1. Sampling from Q requires sampling a latent z from ν (4) then applying the map G,

x Q x=G(z), z ν. (5)

In order to hold, Proposition 1 does not need the generator G to be invertible. We provide a proof in Appendix C.1 which relies on a characterization of probability distribution using generalized moments. We will see later in Section 4 how equation (5) can be used to provide practical sampling algorithms from the GEBM. Next we discuss the advantages of GEBMs.

Advantages of Generalized Energy Based Models. The GEBM deﬁned by (3) can be related to exponential tilting (re-weighting) (Siegmund, 1976; Xie et al., 2016) of the base G. The important difference over classical EBMs is that the base G is allowed to change its support and shape in space. By learning the base G, GEBMs can accurately learn the low-dimensional support of data, just like IGMs do. They also beneﬁt from the ﬂexibility of EBMs for representing densities using an energy E to reﬁne distribution of mass on the support deﬁned by G, as seen in Figure 1(c).

Compared to EBMs, that put mass on the whole space by construction (positive density), GEBMs have the additional ﬂexibility to concentrate the probability mass on a low-dimensional support learned by the base G, provided that the dimension of the latent space Z is smaller than the dimension of the ambient space X: see Figure 1(c) vs Figure 1(d). In the particular case when the dimension of Z is equal to the ambient dimension and G is invertible, the base G becomes supported over the whole space X, and GEBM recover usual EBMs. The next proposition further shows that any EBM can be viewed as a particular cases of GEBMs, as proved in Appendix C.1. Proposition 2. Any EBM with energy E (as in (1)) can be expressed as a GEBM with base G given as a normalizingﬂowwithdensityexp( r(x))andageneralizedenergy E(x)=E(x) r(x). Inthisparticular case, the dimension of the latent is necessarily equal to the data dimension, i.e. dim(Z)=dim(X).

Published as a conference paper at ICLR 2021

Compared to IGMs, that rely on a ﬁxed pre-determined latent noise distribution η, GEBMs offer the additional ﬂexibility of learning a richer latent noise distribution. This is particularly useful when the data is multimodal. In IGMs, such a GANs, the latent noise η is usually unimodal thus requiring a more sophisticated generator to distort a unimodal noise distribution into a distribution with multiple modes, as shown by Cornish et al. (2020). Instead, GEBMs allow to sample from a posterior ν over the latent noise deﬁned in (4). This posterior noise can be multimodal in latent space (by incorporating information from the energy) and thus can put more or less mass in speciﬁc regions of the manifold deﬁned by the base G. This allows GEBMs to capture multimodality in data, provided the support of the base is broad enough to subsume the data support Figure 1(c). The base can be simpler, compared to GANs, as it doesn t need to distort the input noise too much to produce multimodal samples (see Figure 8 in Appendix G.4). This additional ﬂexibility comes at no additional training cost compared to GANs. Indeed, GANs still require another model during training, the discriminator network, but do not use it for sampling. Instead, GEBMs avoid this waist since the base and energy can be trained jointly, with no other additional model, and then both are used for sampling.

3 LEARNING GEBMS

In this section we describe a general procedure for learning GEBMs. We decompose the learning procedure into two steps: an energy learning step and a base learning step. The overall learning procedure alternates between these two steps, as done in GAN training (Goodfellow et al., 2014).

3.1 ENERGY LEARNING

When the base G is ﬁxed, varying the energy E leads to a family of models that all admit a density exp( E AG,E) w.r.t. G. When the base G admits a density exp( r) deﬁned over the whole space, it is possible to learn the energy E by maximizing the likelihood of the model R (E+r)d P AG,E. However, in general G is supported on a lower-dimensional manifold so that r is ill-deﬁned and the usual notion of likelihood cannot be used. Instead, we introduce a generalized notion of likelihood which does not require a well deﬁned density exp( r) for G: Deﬁnition 1 (Generalized Likelihood). The expected G-log-likelihood under a target distribution P of a GEBM model Q with base G and energy E is deﬁned as

LP,G(E):= Z E(x)d P(x) AG,E. (6)

To provide intuitions about the generalized likelihood in Deﬁnition 1, we start by discussing the particular case where KL(P||G) < + . We then present the training method in the general case where P and G might not share the same support, i.e. KL(P||G)=+ .

Special case of ﬁnite KL(P||G). When the Kullback-Leibler divergence between P and G is well deﬁned, (6) corresponds to the Donsker-Varadhan (DV) lower bound on the KL (Donsker and Varadhan, 1975), meaning that KL(P||G) LP,G(E) for all E. Moreover, the following proposition holds:

Proposition 3. Assume that KL(P||G)<+ and 0 E. If, in addition, E maximizes (6), then: KL(P||Q) KL(P||G). (7) In addition, we have that KL(P||Q)=0 when E is the negative log-density ratio of P w.r.t. G.

We refer to Appendix C.1 for a proof. According to (7), the GEBM systematically improves over the IGM deﬁned by G, with no further improvement possible in the limit case when G=P. Hence as long as there is an error in mass on the common support of P and G, the GEBM improves over the base G.

Estimating the likelihood in the General setting. Deﬁnition 1 can be used to learn a maximum likelihood energy E by maximizing LP,G(E) w.r.t. E even when the KL(P||G) is inﬁnite and when P and G don t necessarily share the same support. Such an optimal solution is well deﬁned whenever the set of energies is suitably constrained. This is the case if the energies are parametrized by a compact set Ψ with ψ7 Eψ continuous over Ψ. Estimating the likelihood is then achieved using i.i.d. samples (Xn)1:N,(Ym)1:M from P and G (Tsuboi et al., 2009; Sugiyama et al., 2012; Liu et al., 2017):

ˆLP,G(E)= 1

n=1 E(Xn) log

m=1 exp( E(Ym))

Published as a conference paper at ICLR 2021

In the context of mini-batch stochastic gradient methods, however, M typically ranges from 10 to 1000, which can lead to a poor estimate for the log-partition function AG,E . Moreover, (8) doesn t exploit estimates of AG,E from previous gradient iterations. Instead, we propose an estimator which introduces a variational parameter A R meant to estimate AG,E in an amortized fashion. The key idea is to exploit the convexity of the exponential which directly implies AG,E A exp( A+AG,E)+1 for any A R, with equality only when A=AG,E. Therefore, (6) admits a lower-bound of the form

LP,G(E) Z (E+A)d P Z exp( (E+A))d G+1:=FP,G(E+A),

where we introduced the functional FP,G for concision. Maximizing FP,G(E +A) over A recovers the likelihood LP,G(E). Moreover, jointly maximizing over E and A yields the maximum likelihood energy E and its corresponding log-partition function A =AG,E . This optimization is well-suited for stochastic gradient methods using the following estimator Kanamori et al. (2011):

ˆFP,G(E+A)= 1

n=1 (E(Xn)+A) 1

m=1 exp( (E(Ym)+A))+1. (9)

3.2 BASE LEARNING

Unlike in Section 3.1, varying the base G does not need to preserve the same support. Thus, it is generally not possible to use maximum likelihood methods for learning G. Instead, we propose to use the generalized likelihood (6) evaluated at the optimal energy E as a meaningful loss for learning G, and refer to it as the KL Approximate Lower-bound Estimate (KALE),

KALE(P||G)= sup (E,A) E R FP,G(E+A). (10)

From Section 3.1, KALE(P||G) is always a lower bound on KL(P,G). The bound becomes tight whenever the negative log density of P w.r.t. G is well-deﬁned and belongs to E (Appendix A). Moreover, Proposition 4 shows that KALE is a reliable criterion for measuring convergence, and is a consequence of (Zhang et al., 2017, Theorem B.1), with a proof in Appendix C.2.1: Proposition 4. Assume all energies in E are L-Lipschitz and that any continuous function can be well approximated by linear combinations of energies in E (Assumptions (A) and (B) of Appendix C.2), then KALE(P||G) 0 with equality only if P=G and KALE(P||Gn) 0 iff Gn P in distribution.

The universal approximation assumption holds in particular when E contains feedforward networks. In fact networks with a single neuron are enough, as shown in (Zhang et al., 2017, Theorem 2.3). The Lipschitz assumption holds when additional regularization of the energy is enforced during training by methods such as spectral normalization (Miyato et al., 2018) or additional regularization I(ψ) on the energy Eψ such as the gradient penalty (Gulrajani et al., 2017) as done in Section 6.

Estimating KALE. According to Arora et al. (2017), accurate ﬁnite sample estimates of divergences that result from an optimization procedures (such as in (10)) depend on the richness of the class E; and richer energy classes can result in slower convergence. Unlike divergences such as Jensen-Shannon, KL and the Wasserstein distance, which result from optimizing over a non-parametric and rich class of functions, KALE is restricted to a class of parametric energies Eψ. Thus, (Arora et al., 2017, Theorem 3.1) applies, and guarantees good ﬁnite sample estimates, provided optimization is solved accurately. In Appendix B, we provide an analysis for the more general case where energies are not necessarily parametric but satisfy some further smoothness properties; we emphasize that our rates do not require the strong assumption that the density ratio is bounded above and below as in (Nguyen et al., 2010).

Smoothness of KALE. Learning the base is achieved by minimizing K(θ):=KALE(P||Gθ) over the set of parameters Θ of the generator Gθ using ﬁrst order methods (Duchi et al., 2011; Kingma and Ba, 2014; Arbel et al., 2019). This requires K(θ) to be smooth enough so that gradient methods converge to local minima and avoid instabilities during training (Chu et al., 2020). Ensuring smoothness of losses that result from an optimization procedure, as in (10), can be challenging. Results for the regularized Wasserstein are provided by Sanjabi et al. (2018), while more general losses are considered by Chu et al. (2020), albeit under stronger conditions than for our setting.Theorem 5 shows that when E, Gθ and their gradients are all Lipschitz then K(θ) is smooth enough.We provide a proof for Theorem 5 in Appendix C.2.1.

Published as a conference paper at ICLR 2021

Theorem 5. Under Assumptions (I) to (III) of Appendix C.2, sub-gradient methods on K converge to local optima. Moreover, K is Lipschitz and differentiable for almost all θ Θ with:

K(θ)=exp( AGθ,E ) Z x E (Gθ(z)) θGθ(z)exp( E (Gθ(z)))η(z)dz. (11)

Estimating the gradient in (11) is achieved by ﬁrst optimizing over Eψ and Ausing(9), withadditionalregularization I(ψ). The resulting estimators ˆE and ˆA are plugged in (12) to estimate K(θ) using samples (Zm)1:M from η. Unlike for learning the energy E , which beneﬁts from using the amortized estimator of the log-partition function, we found that using the empirical log-partition for learning the base was more stable. We summarize the training procedure in Algorithm 1, which alternates between learning the energy and the base in a similar fashion to adversarial training.

Algorithm 1 Training GEBM

1: Input P, N,M, nb, ne 2: Output Trained generator Gθ and energy Eψ. 3: Initialize θ , ψ and A. 4: for k=1,...,nb do 5: for j =1,...,ne do 6: Sample {Xn}1:N P and {Yn}1:N Gθ 7: gψ ψ ˆFP,Gθ(Eψ+A)+I(ψ)

8: A log Ä 1 M PM m=1exp( Eψ(Ym)) ä

9: g A exp(A A) 1 10: Update ψ and A using gψ and g A. 11: end for 12: Set ˆE Eψ and ˆA A.

13: Update θ using K(θ) from (12) 14: end for

K(θ)= exp( ˆA )

m=1 x ˆE (Gθ(Zm)) θGθ(Zm)exp( ˆE (Gθ(Zm))). (12)

4 SAMPLING FROM GEBMS

A simple estimate of the empirical distribution of observations under the GEBM is via importance sampling (IS). This consists in ﬁrst sampling multiple points from the base G, and then re-weighting the samples according to the energy E. Although straightforward, this approach can lead to highly unreliable estimates, a well known problem in the Sequential Monte Carlo (SMC) literature which employs IS extensively (Doucet et al., 2001; Del Moral et al., 2006). Other methods such as rejection sampling are known to be inefﬁcient in high dimensions Haugh (2017). Instead, we propose to sample from the posteriorν using MCMC. Recall from (5) that a sample x from Q is of the form x=G(z) with z sampled from the posteriorlatent ν of(4)insteadofthepriorη. Whilesamplingfromη isoftenstraightforward(forinstanceif η is a Gaussian), sampling from ν is generally harder, due to dependence of its density on complex functions E and G. Itisstillpossibletouse MCMCmethodstosamplefromν, however, sincewehaveaccess to its density up to a normalizing constant (4). In particular, we are interested in methods that exploit the gradient of ν, and consider two classes of samplers: Overdamped samplers and Kinetic samplers.

Overdamped samplers are obtained as a time-discretization of the Overdamped Langevin dynamics:

dzt =( zlogη(zt) z E(G(zt)))+

where wt is a standard Brownian motion. The simplest sampler arising from (13) is the Unadjusted Langevin Algorithm (ULA):

Zk+1 =Zk+λ( zlogη(Zk) z E(G(Zk)))+

2λWk+1, Z0 η,

where (Wk)k 0 are i.i.d. standard Gaussians and λ is the step-size. For large k, Zk is an approximate sample from ν (Raginsky et al., 2017, Proposition 3.3). Hence, setting X =G(Zk) for a large enough k provides an approximate sample from the GEBM Q, as summarized in Algorithm 2 of Appendix F.

Kinetic samplers arise from the Kinetic Langevin dynamics which introduce a momentum variable:

dzt =vtdt, dvt = γvtdt+u( logη(zt) E(G(zt)))dt+ p

2γudwt. (14)

with friction coefﬁcient γ 0, inverse mass u 0, momentum vector vt and standard Brownian motion wt. When the mass u 1 becomes negligible compared to the friction coefﬁcient γ, i.e. uγ 2 0, standard results show that (14) recovers the Overdamped dynamics (13). Discretization in time of (14)

Published as a conference paper at ICLR 2021

leads to Kinetic samplers similar to Hamiltonian Monte Carlo (Cheng et al., 2017; Sachs et al., 2017). We consider a particular algorithm from Sachs et al. (2017) which we call Kinetic Langevin Algorithm (KLA) (see Algorithm 3 in Appendix F). Kinetic samplers were shown to better explore the modes of the invariant distribution ν compared to Overdamped ones (see (Neal, 2010; Betancourt et al., 2017) for empirical results and (Cheng et al., 2017) for theory), as also conﬁrmed empirically in Appendix D for image generation tasks using GEBMs. Next, we provide the following convergence result:

Proposition 6. Assume that logη(z) is strongly concave and has a Lipschitz gradient, that E, G and their gradients are all L-Lipschitz. Set xt =G(zt), where zt is given by (14) and call Pt the probability distribution of xt. Then Pt converges to Q in the Wasserstein sense,

W2(Pt,Q) LCe cγt,

where c and C are positive constants independent of t, with c=O(exp( dim(Z))).

Proposition 6 is proved in Appendix C.1 using (Eberle et al., 2017, Corollary 2.6), and implies that (xt)t 0 converges at the same speed as (zt)t 0. When the dimension q of Z is orders of magnitude smaller than the input space dimension d, the process (xt)t 0 converges faster than typical sampling methods on X, for which the exponent controlling the convergence rate is of order O(exp( d)).

5 RELATED WORK

Energy based models. Usually, energy based models are required to have a density w.r.t. to a Lebesgue measure, and do not use a learnable base measure; in other words, models are supported on the whole space. Various methods have been proposed in the literature to learn EBMs. Contrastive Divergence (Hinton, 2002) approximates the gradient of the log-likelihood by sampling from the energy model with MCMC. More recently, (Belanger and Mc Callum, 2016; Xie et al., 2016; 2017; 2018c; 2019; Tu and Gimpel, 2018; Du and Mordatch, 2019; Deng et al., 2020) extend the idea using more sophisticated models and MCMC sampling strategies that lead to higher quality estimators. Score Matching (Hyvärinen, 2005) calculates an alternative objective (the score) to the log-likelihood which is independent of the partition function, and was recently used in the context non-parametric energy functions to provide estimators of the energy that are provably consistent (Sriperumbudur et al., 2017; Sutherland et al., 2018; Arbel and Gretton, 2018; Wenliang et al., 2019). In Noise-Contrastive Estimation (Gutmann and Hyvärinen, 2012), a classiﬁer is trained to distinguish between samples from a ﬁxed proposal distribution and the target P. This provides an estimate for the density ratio between the optimal energy model and the proposal distribution. In a similar spirit, Cranmer et al. (2016) uses a classiﬁer to learn likelihood ratios. Conversely, Grathwohl et al. (2020) interprets the logits of a classiﬁer as an energy model obtained after marginalization over the classes. The resulting model is then trained using Contrastive Divergence. In more recent work, Dai et al. (2019a;b) exploit a dual formulation of the logarithm of the partition function as a supremum over the set of all probability distributions of some functional objective. Yu et al. (2020) explore methods for using general f-divergences, such as Jensen-Shannon, to train EBMs.

Generative Adversarial Networks. Recent work proposes using the discriminator of a trained GAN to improve the generator quality. Rejection sampling (Azadi et al., 2019) and Metropolis Hastings correction (Turner et al., 2019; Neklyudov et al., 2019) perform sampling directly on the high-dimensional input space without using gradient information provided by the discriminator. Moreover, the data distribution is assumed to admit a density w.r.t. the generator. Ding et al. (2019) perform sampling on the feature space of some auxiliary pre-trained network; while Lawson et al. (2019) treat the sampling procedure as a model on its own, learned by maximizing the ELBO. In our case, no auxiliary model is needed. In the present work, sampling doesn t interfere with training, in contrast to recently considered methods to optimize over the latent space during training Wu et al. (2019b;a). In Tanaka (2019), the discriminator is viewed as an optimal transport map between the generator and the data distribution and is used to compute optimized samples from latent space. This is in contrast to the diffusion-based sampling that we consider. In (Xie et al., 2018b;a), two independent models, a full support EBM and a generator network, are trained cooperatively using MCMC. By contrast, in the present work, the energy and base are part of the same model, and the model support is lower-dimensional than the target space X. While we do not address the mode collapse problem, Xu et al. (2018); Nguyen et al. (2017) showed that KL-based losses are resilient to it thanks to the zero-avoiding property of the KL, a good sign for KALE which is derived from KL by Fenchel duality.

Published as a conference paper at ICLR 2021

The closest related approach appears in a study concurrent to the present work (Che et al., 2020), where the authors propose to use Langevin dynamics on the latent space of a GAN generator, but with a different discriminator to ours (derived from the Jensen-Shannon divergence or a Wasserstein-based divergence). Our theory results showing the existence of the loss gradient (Theorem 5), establishing weak convergence of distributions under KALE (Proposition 4), and demonstrating consistency of the KALE estimator (Appendix B) should transfer to the JS and Wasserstein criteria used in that work. Subsequent to the present work, an alternative approach has been recently proposed, based on normalising ﬂows, to learn both the low-dimensional support of the data and the density on this support (Brehmer and Cranmer, 2020). This approach maximises the explicit likelihood of a data projection onto a learned manifold, and may be considered complementary to our approach.

6 EXPERIMENTS

Figure 2: Samples at different iterations of the MCMC chain of Algorithm 3 (left to right).

6.1 IMAGE GENERATION. Experimental setting. We train a GEBM on unsupervised image generation tasks, and compare the quality of generated samples with other methods using the FID score (Heusel et al., 2017) computed on 5 104 generated samples. We consider CIFAR-10 (Krizhevsky, 2009), LSUN (Yu et al., 2015), Celeb A (Liu et al., 2015) and Image Net (Russakovsky et al., 2014) all downsampled to 32x32 resolution to reduce computational cost. We consider two network architectures for each of the base and energy, a smaller one (SNGAN Conv Net) and a larger one (SNGAN Res Net), both of which are from Miyato et al. (2018). For the base we used the SNGAN generator networks from Miyato et al. (2018) with a 100-dimensional Gaussian for the latent noise η. For the energy we used the SNGAN discriminator networks from Miyato et al. (2018). (Details of the networks in Appendix G.1).

We train the models for 150000 generator iterations using Algorithm 1. After training is completed, we rescale the energy by β =100 to get a colder version of the GEBM and sample from it using either Algorithm 2 (ULA) or Algorithm 3 (KLA) with parameters (γ =100,u=1). This colder temperature leads to an improved FID score, and needs relatively few MCMC iterations, as shown in Figure 6 of Appendix D. Sampler convergence to visually plausible modes at low tempteratures is demonstrated in Figure 2. We perform 1000 MCMC iterations with initial step-size of λ=10 4 decreased by 10 every 200 iterations. As a baseline we consider samples generated from the base of the GEBM only (without using information from the energy) and call this KALE-GAN. More details are given in Appendix G.

Results: Table 1 shows that GEBM outperforms both KALE and standard GANs when using the same networks for the base/generator and energy/critic. Moreover, KALE-GAN matches the performance of a standard GAN (with Jensen-Shannon critic), showing that the improvement of GEBM cannot be explained by the switch from Jensen-Shannon to a KALE-based critic. Rather, the improvement is largely due to incorporating the energy function into the model, and sampling using Algorithm 3.

This ﬁnding experimentally validates our claim that incorporating the energy improves the model, and that all else being equal, a GEBM outperforms a GAN with the same generator and critic architecture. Indeed, if the critic is not zero at convergence, then by deﬁnition it contains information on the remaining mismatch between the generator (base) and data mass, which the GEBM incorporates, but the GAN does not. The GEBM also outperforms an EBM even when the latter was trained using a larger network (Res Net) with supervision (S) on Image Net, which is an easier task ( Chen et al. (2019)). More comparisons on Cifar10 and Image Net are provided in Table 4 of Appendix D.

SNGAN (Conv Net) SNGAN (Res Net)

GEBM KALE-GAN GAN GEBM KALE-GAN GAN EBM

Cifar10 23.02 32.03 29.9 19.31 20.19 21.7 38.2 Image Net 13.94 19.37 20.66 20.33 21.00 20.50 14.31 (S)

Table 1: FID scores for two versions of SNGAN from (Miyato et al., 2018) on Cifar10 and Image Net. GEBM: training using Algorithm 1 and sampling using Algorithm 3. KALE-GAN: Only the base of a GEBM is retained for sampling. GAN: training as in (Miyato et al., 2018) with q =128 for the latent dimension as it worked best. EBM: results from Du and Mordatch (2019) with supervised training on Image Net (S).

Published as a conference paper at ICLR 2021

Table 2 shows different sampling methods using the same trained networks (generator and critic), with KALE-GAN as a baseline. All energy-exploiting methods outperform the unmodiﬁed KALE-GAN with the same architecture. That said, our method (both ULA and KLA) outperforms both (IHM) (Turner et al., 2019) and (DOT) (Tanaka, 2019), which both use the energy information.

Cifar10 LSUN Celeb A Image Net

KALE-GAN 32.03 21.67 6.91 19.37 IHM 30.47 20.63 6.39 18.15 DOT 26.35 20.41 5.93 16.21 GEBM (ULA) 23.02 16.23 5.21 14.00 GEBM (KLA) 24.29 15.25 5.38 13.94

Table 2: FID scores for different sampling methods using the same trained SNGAN (Conv Net): KALE-GAN as a baseline w/o critic information.

In Table 2, KLA was used in the high friction regime γ =100 and thus behaves like ULA. This allows to obtain sharper samples concentrated around the modes of the GEBM thus improving the FID score. If, instead, the goal is to encourage more exploration of the modes of the GEBM, then KLA with a smaller γ is a better alternative than ULA, as the former can explore multiple modes/images within the same MCMC chain, unlike (ULA): see Figures 3 to 5 of Appendix D. Moving from one mode to another results in an increased FID score while between modes, however, which can be avoided by decreasing λ.

6.2 DENSITY ESTIMATION

Motivation. We next consider the particular setting where the likelihood of the model is well-deﬁned, and admits a closed form expression. This is intended principally as a sanity check that our proposed training method in Algorithm 1 succeeds in learning maximum likelihood solutions. Outside of this setting, closed form expressions of the normalizing constant are not available for generic GEBMs. While this is not an issue (since the proposed method doesn t require a closed form expression for the normalizing constant), in this experiment only, we want to have access to closed form expressions, as they enable a direct comparison with other density estimation methods.

Experimental setting. To have a closed-form likelihood, we consider the case where the dimension of the latent space is equal to data-dimension, and choose the base G of the GEBM to be a Real NVP (Ding et al. (2019) ) with density exp( r(x)) and energy E(x)=h(x) r(x). Thus, in this particular case, the GEBM has a well deﬁned likelihood over the whole space, and we are precisely in the setting of Proposition 2, which shows that this GEBM is equal to an EBM with density proportional to exp( h). We further require the EBM to be a second Real NVP so that its density has a closed form expression. We consider 5 UCI datasets (Dheeru and Taniskidou, 2017) for which we use the same pre-processing as in (Wenliang et al., 2019). For comparison, we train the EBM by direct maximum likelihood (ML) and contrastive divergence (CD). To train the GEBM, we use Algorithm 1, which doesn t directly exploit the closed-form expression of the likelihood (unlike direct ML). We thus use either (8) (KALE-DV) or (9) (KALE-F) to estimate the normalizing constant. More details are given in Appendix G.2.

Results. Table 3 reports the Negative Log-Likelihood (NLL) evaluated on the test set and corresponding tothebestperformanceonthevalidationset. Trainingthe GEBMusing Algorithm1leadstocomparable performance to (CD) and (ML). As shown in Figure 7 of Appendix E, (KALE-DV) and (KALEF) maintain a small error gap between the training and test NLL and, as discussed in Section 3.1 and Appendix F, (KALE-F) leads to more accurate estimates of the log-partition function, with a relative error of order 0.1% compared to 10% for (KALE-DV).

Red Wine d=11,N 103 Whitewine d=11,N 103 Parkinsons d=15,N 103 Hepmass d=22,N 105 Miniboone d=43,N 104

NVP w ML 11.98 13.05 14.5 24.89 42.28 NVP w CD 11.88 13.01 14.06 22.89 39.36 NVP w KALE (DV) 11.6 12.77 13.26 26.56 46.48 NVP w KALE (F) 11.19 12.66 13.26 24.66 38.35

Table 3: UCI datasets: Negative log-likelihood computed on the test set and corresponding to the best performance on the validation set. Best method in boldface.

7 ACKNOWLEDGMENTS We thank Mihaela Rosca for insightful discussions and Song Liu, Bo Dai and Hanjun Dai for pointing us to important related work.

Published as a conference paper at ICLR 2021

REFERENCES Arbel, M. and Gretton, A. (2018). Kernel Conditional Exponential Family. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1337 1346.

Arbel, M., Gretton, A., Li, W., and Montufar, G. (2019). Kernelized Wasserstein Natural Gradient.

Arbel, M., Sutherland, D., Binkowski, M., and Gretton, A. (2018). On gradient regularizers for mmd gans. In Advances in Neural Information Processing Systems 31. Curran Associates, Inc.

Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, International Convention Centre, Sydney, Australia. PMLR.

Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. (2017). Generalization and equilibrium in generative adversarial nets (GANs). In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 224 232. PMLR.

Azadi, S., Olsson, C., Darrell, T., Goodfellow, I., and Odena, A. (2019). Discriminator rejection sampling. In International Conference on Learning Representations.

Belanger, D. and Mc Callum, A. (2016). Structured prediction energy networks. In International Conference on Machine Learning, pages 983 992.

Betancourt, M., Byrne, S., Livingstone, S., and Girolami, M. (2017). The geometric foundations of Hamiltonian Monte Carlo. Bernoulli, 23(4A):2257 2298.

Bi nkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. (2018). Demystifying MMD GANs. In International Conference on Learning Representations.

Bottou, L., Arjovsky, M., Lopez-Paz, D., and Oquab, M. (2017). Geometrical insights for implicit generative modeling. In Braverman Readings in Machine Learning.

Brehmer, J. and Cranmer, K. (2020). Flows for simultaneous manifold learning and density estimation. ar Xiv preprint ar Xiv:2003.13913.

Brock, A., Donahue, J., and Simonyan, K. (2018). Large scale gan training for high ﬁdelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096.

Che, T., Zhang, R., Sohl-Dickstein, J., Larochelle, H., Paull, L., Cao, Y., and Bengio, Y. (2020). Your GAN is secretly an energy-based model and you should use discriminator driven latent sampling.

Chen, T., Zhai, X., Ritter, M., Lucic, M., and Houlsby, N. (2019). Self-supervised gans via auxiliary rotation loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12154 12163.

Cheng, X., Chatterji, N. S., Bartlett, P. L., and Jordan, M. I. (2017). Underdamped langevin mcmc: A non-asymptotic analysis. ar Xiv preprint ar Xiv:1707.03663.

Chu, C., Minami, K., and Fukumizu, K. (2020). Smoothness and stability in gans. In International Conference on Learning Representations.

Cornish, R., Caterini, A. L., Deligiannidis, G., and Doucet, A. (2020). Relaxing bijectivity constraints with continuously indexed normalising ﬂows.

Cranmer, K., Pavez, J., and Louppe, G. (2016). Approximating likelihood ratios with calibrated discriminative classiﬁers.

Dai, B., Dai, H., Gretton, A., Song, L., Schuurmans, D., and He, N. (2019a). Kernel exponential family estimation via doubly dual embedding. In Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 2321 2330. PMLR.

Dai, B., Liu, Z., Dai, H., He, N., Gretton, A., Song, L., and Schuurmans, D. (2019b). Exponential Family Estimation via Adversarial Dynamics Embedding. ar Xiv:1904.12083 [cs, stat]. ar Xiv: 1904.12083.

Published as a conference paper at ICLR 2021

Davis, D. and Drusvyatskiy, D. (2018). Stochastic subgradient method converges at the rate $O(k {-1/4})$ on weakly convex functions. ar Xiv:1802.02988 [cs, math].

Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411 436.

Deng, Y., Bakhtin, A., Ott, M., Szlam, A., and Ranzato, M. (2020). Residual energy-based models for text generation. ar Xiv preprint ar Xiv:2004.11714.

Dheeru, D. and Taniskidou, E. K. (2017). Uci machine learning repository.

Ding, X., Wang, Z. J., and Welch, W. J. (2019). Subsampling Generative Adversarial Networks: Density Ratio Estimation in Feature Space with Softplus Loss.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2016). Density estimation using real nvp.

Donahue, J. and Simonyan, K. (2019). Large Scale Adversarial Representation Learning. ar Xiv:1907.02544 [cs, stat]. ar Xiv: 1907.02544.

Donsker, M. D. and Varadhan, S. R. S. (1975). Asymptotic evaluation of certain markov process expectations for large time, i. 28(1):1 47. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.3160280102.

Doucet, A., Freitas, N. d., and Gordon, N. (2001). Sequential Monte Carlo Methods in Practice. Information Science and Statistics. Springer-Verlag, New York.

Du, Y. and Mordatch, I. (2019). Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems 32, pages 3608 3618. Curran Associates, Inc.

Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(Jul):2121 2159.

Eberle, A., Guillin, A., and Zimmer, R. (2017). Couplings and quantitative contraction rates for Langevin dynamics. The Annals of Probability.

Ekeland, I. and Témam, R. (1999). Convex Analysis and Variational Problems. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics.

Feydy, J., Séjourné, T., Vialard, F.-X., Amari, S.-i., Trouvé, A., and Peyré, G. (2019). Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 2681 2690.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672 2680. Curran Associates, Inc.

Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D., Norouzi, M., and Swersky, K. (2020). Your classiﬁer is secretly an energy based model and you should treat it like one.

Grover, A., Song, J., Kapoor, A., Tran, K., Agarwal, A., Horvitz, E. J., and Ermon, S. (2019). Bias correction of learned generative models using likelihood-free importance weighting. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. (2017). Improved training of wasserstein gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA. Curran Associates Inc.

Gutmann, M. U. and Hyvärinen, A. (2012). Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. The Journal of Machine Learning Research, 13(null):307 361.

Haugh, M. (2017). Mcmc and bayesian modeling. IEOR E4703 Monte-Carlo Simulation, Columbia University.

Published as a conference paper at ICLR 2021

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30, pages 6626 6637. Curran Associates, Inc.

Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771 1800.

Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In Advances in neural information processing systems, pages 4565 4573.

Hyvärinen, A. (2005). Estimation of Non-Normalized Statistical Models by Score Matching. The Journal of Machine Learning Research, 6:695 709.

Kanamori, T., Suzuki, T., and Sugiyama, M. (2011). f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models. IEEE Transactions on Information Theory, 58(2):708 720.

Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic Optimization. ar Xiv:1412.6980 [cs]. ar Xiv: 1412.6980.

Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. ICLR.

Klenke, A. (2008). Probability Theory: A Comprehensive Course. World Publishing Corporation.

Kodali, N., Abernethy, J., Hays, J., and Kira, Z. (2017). On Convergence and Stability of GANs. ar Xiv:1705.07215 [cs]. ar Xiv: 1705.07215.

Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.

Lawson, J., Tucker, G., Dai, B., and Ranganath, R. (2019). Energy-inspired models: Learning with sampler-induced distributions. In Advances in Neural Information Processing Systems 32, pages 8501 8513. Curran Associates, Inc.

Le Cun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, F.-J. (2006). Predicting Structured Data, chapter A Tutorial on Energy-Based Learning. MIT Press.

Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Poczos, B. (2017). Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems 30, pages 2203 2213. Curran Associates, Inc.

Liu, S., Bousquet, O., and Chaudhuri, K. (2017). Approximation and Convergence Properties of Generative Adversarial Learning.

Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild.

Milgrom, P. and Segal, I. (2002). Envelope Theorems for Arbitrary Choice Sets. Econometrica, 70.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. In International Conference on Learning Representations.

Nagarajan, V. and Kolter, J. Z. (2017). Gradient descent gan optimization is locally stable.

Neal, R. M. (2010). Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo.

Neklyudov, K., Egorov, E., and Vetrov, D. (2019). The implicit metropolis-hastings algorithm.

Nguyen, T., Le, T., Vu, H., and Phung, D. (2017). Dual discriminator generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2670 2680.

Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861.

Published as a conference paper at ICLR 2021

Nowozin, S., Cseke, B., and Tomioka, R. (2016). f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems 29, pages 271 279. Curran Associates, Inc.

Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. (2016). Pixel recurrent neural networks. ar Xiv preprint ar Xiv:1601.06759.

Ostrovski, G., Dabney, W., and Munos, R. (2018). Autoregressive quantile networks for generative modeling. ar Xiv preprint ar Xiv:1806.05575.

Papamakarios, G., Pavlakou, T., and Murray, I. (2017). Masked autoregressive ﬂow for density estimation. NIPS.

Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434.

Raginsky, M., Rakhlin, A., and Telgarsky, M. (2017). Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis.

Retherford, J. R. (1978). Review: J. diestel and j. j. uhl, jr., vector measures. Bull. Amer. Math. Soc., 84(4):681 685.

Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing ﬂows. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pages 1530 1538. JMLR.org.

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In ICML, pages 1278 1286.

Rockafellar, R. T. (1970). Convex analysis. Princeton Mathematical Series. Princeton University Press, Princeton, N. J.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014). Image Net Large Scale Visual Recognition Challenge. ar Xiv:1409.0575 [cs]. ar Xiv: 1409.0575.

Sachs, M., Leimkuhler, B., and Danos, V. (2017). Langevin Dynamics with Variable Coefﬁcients and Nonconservative Forces: From Stationary States to Numerical Methods. Entropy, 19.

Sanjabi, M., Ba, J., Razaviyayn, M., and Lee, J. D. (2018). On the convergence and robustness of training gans with regularized optimal transport. In Advances in Neural Information Processing Systems 31, pages 7091 7101. Curran Associates, Inc.

Siegmund, D. (1976). Importance sampling in the monte carlo study of sequential tests. The Annals of Statistics, pages 673 684.

Simon-Gabriel, C.-J. and Scholkopf, B. (2018). Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. Journal of Machine Learning Research, 19(44):1 29.

Simsekli, U., Zhu, L., Teh, Y. W., and Gurbuzbalaban, M. (2020). Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise. ar Xiv:2002.05685 [cs, stat]. ar Xiv: 2002.05685.

Sriperumbudur, B., Fukumizu, K., Kumar, R., Gretton, A., and Hyvärinen, A. (2017). Density estimation in inﬁnite dimensional exponential families. Journal of Machine Learning Research.

Sugiyama, M., Suzuki, T., and Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge University Press.

Sutherland, D., Strathmann, H., Arbel, M., and Gretton, A. (2018). Efﬁcient and principled score estimation with Nystrom kernel exponential families. In International Conference on Artiﬁcial Intelligence and Statistics, pages 652 660.

Published as a conference paper at ICLR 2021

Tanaka, A. (2019). Discriminator optimal transport. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc.

Thekumparampil, K. K., Jain, P., Netrapalli, P., and Oh, S. (2019). Efﬁcient algorithms for smooth minimax optimization. In Advances in Neural Information Processing Systems 32, pages 12680 12691. Curran Associates, Inc.

Thiry, L., Arbel, M., Belilovsky, E., and Oyallon, E. (2021). The unreasonable effectiveness of patches in deep convolutional kernels methods. In International Conference on Learning Representations.

Tsuboi, Y., Kashima, H., Hido, S., Bickel, S., and Sugiyama, M. (2009). Direct density ratio estimation for large-scale covariate shift adaptation. Journal of Information Processing, 17:138 155.

Tu, L. and Gimpel, K. (2018). Learning approximate inference networks for structured prediction. ar Xiv preprint ar Xiv:1803.03376.

Turner, R., Hung, J., Frank, E., Saatchi, Y., and Yosinski, J. (2019). Metropolis-Hastings generative adversarial networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6345 6353, Long Beach, California, USA. PMLR.

Villani, C. (2009). Optimal transport: Old and new. Technical report.

Wenliang, L., Sutherland, D., Strathmann, H., and Gretton, A. (2019). Learning deep kernels for exponential family densities. In International Conference on Machine Learning, pages 6737 6746.

Wu, Y., Donahue, J., Balduzzi, D., Simonyan, K., and Lillicrap, T. (2019a). LOGAN: Latent Optimisation for Generative Adversarial Networks. ar Xiv:1912.00953 [cs, stat]. ar Xiv: 1912.00953.

Wu, Y., Rosca, M., and Lillicrap, T. (2019b). Deep compressed sensing. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6850 6860, Long Beach, California, USA. PMLR.

Xie, J., Lu, Y., Gao, R., and Wu, Y. N. (2018a). Cooperative learning of energy-based model and latent variable model via mcmc teaching. In AAAI, volume 1, page 7.

Xie, J., Lu, Y., Gao, R., Zhu, S.-C., and Wu, Y. N. (2018b). Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence, 42(1):27 45.

Xie, J., Lu, Y., Zhu, S.-C., and Wu, Y. (2016). A theory of generative convnet. In International Conference on Machine Learning, pages 2635 2644.

Xie, J., Zheng, Z., Gao, R., Wang, W., Zhu, S.-C., and Nian Wu, Y. (2018c). Learning descriptor networks for 3d shape synthesis and analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8629 8638.

Xie, J., Zhu, S.-C., and Nian Wu, Y. (2017). Synthesizing dynamic patterns by spatial-temporal generative convnet. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 7093 7101.

Xie, J., Zhu, S.-C., and Wu, Y. N. (2019). Learning energy-based spatial-temporal generative convnets for dynamic patterns. IEEE transactions on pattern analysis and machine intelligence.

Xu, K., Du, C., Li, C., Zhu, J., and Zhang, B. (2018). Learning implicit generative models by teaching density estimators. ar Xiv preprint ar Xiv:1807.03870.

Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and Xiao, J. (2015). LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop.

Yu, L., Song, Y., Song, J., and Ermon, S. (2020). Training deep energy-based models with f-divergence minimization.

Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. Proceedings of machine learning research, 70:3987.

Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. (2017). On the Discrimination-Generalization Tradeoff in GANs. ar Xiv:1711.02771 [cs, stat]. ar Xiv: 1711.02771.

Published as a conference paper at ICLR 2021

A KL APPROXIMATE LOWER-BOUND ESTIMATE

We discuss the relation between KALE (10) and the Kullback-Leibler divergence via Fenchel duality. Recall that a distribution P is said to admit a density w.r.t. G if there exists a real-valued measurable function r0 that is integrable w.r.t. G and satisﬁes d P = r0d G. Such a density is also called the Radon-Nikodym derivative of P w.r.t. G. In this case, we have:

KL(P||G)= Z r0log(r0)d G. (15)

Nguyen et al. (2010); Nowozin et al. (2016) derived a variational formulation for the KL using Fenchel duality. By the duality theorem (Rockafellar, 1970), the convex and lower semi-continuous function ζ :u7 ulog(u) that appears in (15) can be expressed as the supremum of a concave function:

ζ(u)=sup v uv ζ (v).

The function ζ is called the Fenchel dual and is deﬁned as ζ (v)=supuuv ζ(u). By convention, the value of the objective is set to whenever u is outside of the domain of deﬁnition of ζ . When ζ(u) = ulog(u), the Fenchel dual ζ (v) admits a closed form expression of the form ζ (v)=exp(v 1). Using the expression of ζ in terms of its Fenchel dual ζ , it is possible to express KL(P||G) as the supremum of the variational objective (16) over all measurable functions h.

F(h):= Z hd P Z exp( h)d G+1. (16)

Nguyen et al. (2010) provided the variational formulation for the reverse KL using a different choice for ζ: (ζ(u) = log(u)). We refer to (Nowozin et al., 2016) for general f-divergences. Choosing a smaller set of functions H in the variational objective (16) will lead to a lower bound on the KL. This is the KL Approximate Lower-bound Estimate (KALE):

KALE(P||G)= sup h H F(h) (17)

In general, KL(P||G) KALE(P||G). The bound is tight whenever the negative log-density h0 = logr0 belongs to H; however, we do not require r0 to be well-deﬁned in general. Equation (17) has the advantage that it can be estimated using samples from P and G. Given i.i.d. samples (X1,...,XN) and (Y1,...,YM) from P and G, we denote by ˆP and ˆG the corresponding empirical distributions. A simple approach to estimate KALE(P||G) is to use an M-estimator. This is achieved by optimizing the penalized objective

ˆh:=argmax h H F(h) λ

2 I2(h), (18)

where F is an empirical version of F and I2(h) is a penalty term that prevents overﬁtting due to ﬁnite samples. The penalty I2(h) acts as a regularizer favoring smoother solutions while the parameter λ determines the strength of the smoothing and is chosen to decrease as the sample size N and M increase. The M-estimator of KALE(P||G) is obtained simply by plugging in ˆh into the empirical objective F(h):

KALE(P||G):= F(ˆh). (19)

We defer the consistency analysis of (19) to Appendix B where we provide convergence rates in a setting where the set of functions H is a Reproducing Kernel Hilbert Space and under weaker assumptions that were not covered by the framework of Nguyen et al. (2010).

B CONVERGENCE RATES OF KALE

In this section, we provide a convergence rate for the estimator in (19) when H is an RKHS. The theory remains the same whether H contains constants or not. With this choice, the Representer Theorem allows us to reduce the potentially inﬁnite-dimensional optimization problem in (18) to a convex ﬁnite-dimensional one. We further restrict ourselves to the well-speciﬁed case where the density r0 of P w.r.t. G is well-deﬁned and belongs to H, so that KALE matches the KL. While Nguyen et al.

Published as a conference paper at ICLR 2021

(2010) (Theorem 3) provides a convergence rate of 1/

N for a related M-estimator, this requires the density r0 to be lower-bounded by 0 as well as (generally) upper-bounded. This can be quite restrictive if, for instance, r0 is the density ratio of two gaussians. In Theorem 7, we provide a similar convergence rate for the estimator deﬁned in (19) without requiring r0 to be bounded. We start by brieﬂy introducing some notations, the working assumptions and the statement of the convergence result in Appendix B.1 and provide the proofs in Appendix B.2.

B.1 STATEMENT OF THE RESULT

We recall that an RKHS H of functions deﬁned on a domain X Rd and with kernel k is a Hilbert space with dot product .,. , such that y7 k(x,y) belongs to H for any x X, and

k(x,y)= k(x,.),k(y,.) , x,y X.

Any function h in H satisﬁes the reproducing property f(x)= f,k(x,.) for any x X.

Recall that KALE(P||G) is obtained as an optimization problem

KALE(P||G)= sup h H F(h) (20)

where F is given by:

F(h):= Z hd P Z exp( h)d G+1.

Since the negative log density ratio h0 is assumed to belong to H, this directly implies that the supremum of F is achieved at h0 and F(h0) = KALE(P||G). We are interested in estimating KALE(P||G) using the empirical distributions ˆP and ˆG,

n=1 δXn, ˆG:= 1

where (Xn)1 n N and (Yn)1 n N are i.i.d. samples from P and G. For this purpose we introduce the empirical objective functional,

F(h):= Z hdˆP Z exp( h)dˆG+1.

The proposed estimator is obtained by solving a regularized empirical problem,

sup h H F(h) λ

2 h 2, (21)

with a corresponding population version,

sup h H F(h) λ

2 h 2. (22)

Finally, we introduce D(h,δ) and Γ(h,δ):

D(h,δ)= Z δexp( h)d G Z δd P,

Γ(h,δ)= Z Z 1

0 (1 t)δ2exp( (h+tδ))d G.

The empirical versions of D(h,δ) and Γ(h,δ) are denoted ˆD(h,δ) and ˆΓ(h,δ). Later, we will show that D(h,δ) ˆD(h,δ) are in fact the gradients of F(h) and F(h) along the direction δ.

We state now the working assumptions:

(i) The supremum of F over H is attained at h0.

Published as a conference paper at ICLR 2021

(ii) The following quantities are ﬁnite for some positive ϵ: Z

k(x,x) d P(x), Z

k(x,x)exp(( h0 +ϵ)

k(x,x)) d G(x), Z k(x,x)exp(( h0 +ϵ)

k(x,x)) d G(x).

(iii) For any h H, if D(h,δ)=0 for all δ then h=h0. Theorem 7. Fix any 1 > η > 0. Under Assumptions (i) to (iii), and provided that λ = 1

N , it holds with probability at least 1 2η that

| F(ˆh) F(h0)| M (η,h0)

for a constant M (η,h0) that depends only on η and h0.

The assumptions in Theorem 7 essentially state that the kernel associated to the RKHS H needs to satisfy some integrability requirements. That is to guarantee that the gradient δ7 F(h)(δ) and its empirical version are well-deﬁned and continuous. In addition, the optimality condition F(h)=0 is assumed to characterize the global solution h0. This will be the case if the kernel is characteristic Simon-Gabriel and Scholkopf (2018). The proof of Theorem 7, in Appendix B.2, takes advantage of the Hilbert structure of the set H, the convexity of the functional F and the optimality condition F(ˆh)=λˆh of the regularized problem, all of which turn out to be sufﬁcient for controlling the error of (19).

We state now the proof of Theorem 7 with subsequent lemmas and propositions.

Proof of Theorem 7. We begin with the following inequalities:

λ 2 ( ˆh 2 h0 2) F(ˆh) F(h0) F(h0),ˆh h0 .

The ﬁrst inequality is by deﬁnition of ˆh while the second is obtained by concavity of F. For simplicity we write B= ˆh h0 and C = F(h0) L(h0) . Using Cauchy-Schwarz and triangular inequalities, it is easy to see that

2 B2+2B h0 F(ˆh) F(h0) CB.

Moreover, by triangular inequality, it holds that

B hλ h0 + ˆh hλ .

Lemma 11 ensures that A(λ)= hλ h0 converges to 0 as λ 0. Furthermore, by Proposition 12, we have ˆh hλ 1

λD where D(λ)= F(hλ) L(hλ) . Now choosing λ= 1

N and applying Chebychev inequality in Lemma 8, it follows that for any 1>η>0, we have with probability greater than 1 2η that both

D(λ) C( h0 η)

N , C C( h0 ,η)

where C( h0 ,η) is deﬁned in Lemma 8. This allows to conclude that for any η > 0, it holds with probability at least 1 2η that | F(ˆh) F(h0)| M (η,h0)

N where M (η,h0) depends only on η and h0.

We proceed using the following lemma, which provides an expression for D(h,δ) and ˆD(h,δ) along with a probabilistic bound:

Published as a conference paper at ICLR 2021

Lemma 8. Under Assumptions (i) and (ii), for any h H such that h h0 +ϵ, there exists D(h) in H satisfying

D(h,δ)= δ,D(h) ,

and for any h H, there exists D(h) satisfying

D(h,δ)= δ, D(h) .

Moreover, for any 0<η <1 and any h H such that h h0 +ϵ:=M, it holds with probability greater than 1 η that

D(h) D(h) C(M,η)

where C(M,η) depends only on M and η.

Proof. First, we show that δ 7 D(h,δ) is a bounded linear operator. Indeed, Assumption (ii) ensures that k(x,.) and k(x,.)exp( h(x)) are Bochner integrable w.r.t. P and G (Retherford (1978)), hence D(h,δ) is obtained as

D(h,δ):= δ,µexp( h)G µP ,

where µexp( h)G = R k(x,.)exp( h(x))d G and µP = R k(x,.)d P. Deﬁning D(h) to be =µexp( h)G µP leads to the desired result. D(h) is simply obtained by taking the empirical version of D(h).

Finally, the probabilistic inequality is a simple consequence of Chebychev s inequality.

The next lemma states that F(h) and F(h) are Frechet differentiable. Lemma 9. Under Assumptions (i) and (ii) , h 7 F(h) is Frechet differentiable on the open ball of radius h0 +ϵ while h 7 F(h) is Frechet differentiable on H. Their gradients are given by D(h) and D(h) as deﬁned in Lemma 8,

F(h)=D(h), F(h)= D(h)

Proof. The empirical functional F(h) is differentiable since it is a ﬁnite sum of differentiable functions, and its gradient is simply given by D(h). For the population functional, we use second order Taylor expansion of exp with integral remainder, which gives

F(h+δ)=F(h) D(h,δ)+Γ(h,δ).

By Assumption (ii) we know that Γ(h,δ)

δ converges to 0 as soon as δ 0. This allows to directly conclude that F is Frechet differentiable, with differential given by δ 7 D(h,δ). By Lemma 8, we conclude the existence of a gradient F(h) which is in fact given by F(h)=D(h).

From now on, we will only use the notation F(h) and F(h) to refer to the gradients of F(h) and F(h). The following lemma states that (21) and (22) have a unique global optimum, and gives a ﬁrst order optimality condition.

Lemma 10. The problems (21) and (22) admit unique global solutions ˆh and hλ in H. Moreover, the following ﬁrst order optimality conditions hold:

λˆh= F(ˆh), λhλ = F(hλ).

Proof. For (21), existence and uniqueness of a minimizer ˆh is a simple consequence of continuity and strong concavity of the regularized objective. We now show the existence result for (22). Let s introduce Gλ(h)= F(h)+ λ

2 h 2 for simplicity. Uniqueness is a consequence of the strong convexity of Gλ. For the existence, consider a sequence of elements fk H such that Gλ(fk) infh HGλ(h). If h0

Published as a conference paper at ICLR 2021

is not the global solution, then it must hold for k large enough that Gλ(fk) Gλ(h0). We also know that F(fk) F(h0), hence, it is easy to see that fk h0 for k large enough. This implies that fk is a bounded sequence, therefore it admits a weakly convergent sub-sequence by weak compactness. Without loss of generality we assume that fk weakly converges to some element hλ H and that fk h0 . Hence, hλ liminfk fk h0 . Recall now that by deﬁnition of weak convergence, we have fk(x) k hλ(x) for all x X. By Assumption (ii), we can apply the dominated convergence theorem to ensure that F(fk) F(hλ). Taking the limit of Gλfk, the following inequality holds:

sup h H Gλ(h)=limsup k Gλ(fk) Gλ(hλ).

Finally, by Lemma 9 we know that F is Frechet differentiable, hence we can use Ekeland and Témam (1999) (Proposition 2.1) to conclude that F(hλ) = λhλ. We use exactly the same arguments for (21).

Next, we show that hλ converges towards h0 in H.

Lemma 11. Under Assumptions (i) to (iii) it holds that:

A(λ):= hλ h0 0.

Proof. We will ﬁrst prove that hλ converges weakly towards h0, and then conclude that it must also converge strongly. We start with the following inequalities:

0 F(hλ) F(h0) λ

2 ( hλ 2 h0 2).

These are simple consequences of the deﬁnitions of hλ and h0 as optimal solutions to (20) and (21). This implies that hλ is always bounded by h0 . Consider now an arbitrary sequence (λm)m 0 converging to 0. Since hλm is bounded by h0 , it follows by weak-compactness of balls in H that hλm admits a weakly convergent sub-sequence. Without loss of generality we can assume that hλm is itself weakly converging towards an element h . We will show now that h must be equal to h0. Indeed, by optimality of hλm, it must hold that

λmhλm = F(hm).

This implies that F(hm) converges weakly to 0. On the other hand, by Assumption (ii), we can conclude that F(hm) must also converge weakly towards F(h ), hence F(h ) = 0. Finally by Assumption (iii) we know that h0 is the unique solution to the equation F(h)=0 , hence h =h0. We have shown so far that any subsequence of hλm that converges weakly, must converge weakly towards h0. This allows to conclude that hλm actually converges weakly towards h0. Moreover, we also have by deﬁnition of weak convergence that:

h0 lim inf m hλm .

Recalling now that hλm h0 it follows that hλm converges towards h0 . Hence, we have the following two properties:

hλm converges weakly towards h0,

hλm converges towards h0 .

This allows to directly conclude that hλm h0 converges to 0.

Proposition 12. We have that:

λ ˆF(hλ) F(hλ)

Proof. By deﬁnition of ˆh and hλ the following optimality conditions hold:

λˆh= F(ˆh), λhλ = F(hλ).

Published as a conference paper at ICLR 2021

We can then simply write:

λ(ˆh hλ) ( F(ˆh) F(hλ))= F(hλ) F(hλ).

Now introducing δ :=ˆh hλ and E := F(ˆh) F(hλ) for simplicity and taking the squared norm of the above equation, it follows that

λ2 δ 2+ E 2 2λ δ,E = F(hλ) F(hλ) 2.

By concavity of F on H we know that ˆh hλ,E 0. Therefore:

λ2 ˆh hλ 2 F(hλ) F(hλ) 2.

C LATENT NOISE SAMPLING AND SMOOTHNESS OF KALE

C.1 LATENT SPACE SAMPLING

Here we prove Proposition 6 for which we make the assumptions more precise: Assumption 1. We make the following assumption:

logη is strongly concave and admits a Lipschitz gradient.

There exists a non-negative constant L such that for any x,x X and z,z Z:

|E(x) E(x )| x x , x E(x) x E(x ) x x

|G(z) G(z )| z z , z G(z) z G(z ) z z

Throughout this section, we introduce U(z):= log(η(z))+E(G(z)) for simplicity.

Proof of Proposition 1 . To sample from QG,E, we ﬁrst need to identify the posterior latent distribution νG,E used to produce those samples. We rely on (23) which holds by deﬁnition of QG,E for any test function h on X: Z h(x)d Q(x)= Z h(G(z))f(G(z))η(z)dz, (23)

Hence, the posterior latent distribution is given by ν(z) = η(z)f(G(z)), and samples from GEBM are produced by ﬁrst sampling from νG,E, then applying the implicit map G,

X Q X =G(Z), Z ν.

Proof of Proposition 2. the base distribution G admits a density on the whole space denoted by exp( r(x)) and the energy E is of the form E(x) = E(x) r(x) for some parametric function E, it is easy to see that Q has a density proportional to exp( E) and is therefore equivalent to a standard EBM with energy E.

The converse holds as well, meaning that for any EBM with energy E, it is possible to construct a GEBM using an importance weighting strategy. This is achieved by ﬁrst choosing a base G, which is required to have an explicit density exp( r) up to a normalizing constant, then deﬁning the energy of the GEBM to be E(x)=E(x) r(x) so that:

d Q(x) exp( E(x))d Gθ(x) exp( E(x))dx (24)

Equation (24) effectively depends only on E(x) and not on G since the factor exp(r) exactly compensates for the density of G. The requirement that the base also admits a tractable implicit map G can be met by choosing G to be a normalizing ﬂow (Rezende and Mohamed, 2015) and does not restrict the class of possible EBMs that can be expressed as GEBMs.

Published as a conference paper at ICLR 2021

Proof of Proposition 6. Let πt be the probability distribution of (zt,vt) at time t of the diffusion in (14), which we recall that

dzt =vtdt, dvt = (γvt+u U(zt))+

We call π its corresponding invariant distribution given by

π (z,v) exp Å U(z) 1

By Lemma 13 we know that U is dissipative, bounded from below, and has a Lipschitz gradient. This allows to directly apply (Eberle et al., 2017)(Corollary 2.6.) which implies that

W2(πt,π ) Cexp( tc),

where c is a positive constant and C only depends on π and the initial distribution π0. Moreover, the constant c is given explicitly in (Eberle et al., 2017, Theorem 2.3) and is of order 0(e q) where q is the dimension of the latent space Z.

We now consider an optimal coupling Πt between πt and π0. Given joints samples ((zt,vt),(z,v)) from Πt, we consider the following samples in input space (xt,x) := (G(zt),G(z)). Since zt and z have marginals πt and π , it is easy to see that xt Pt and x Q. Therefore, by deﬁnition of the W2 distance, we have the following bound:

W 2 2 (Pt,Q) E xt x 2

Z G(zt) G(z) 2dΠt(zt,z)

L2 Z zt z 2dΠt(zt,z)

L2W 2 2 (πt,π ) C2L2exp( 2tc).

The second line uses the deﬁnition of (xt,x) as joint samples obtained by mapping (zt,z). The third line uses the assumption that B is L-Lipschitz. Finally, the last line uses that Πt is an optimal coupling between πt and π .

Lemma 13. Under Assumption 1, there exists A>0 and λ (0, 1

4] such that

1 2z t U(z) λ Å U(z)+ γ2

4u z 2 ã A, z Z, (25)

where γ and u are the coefﬁcients appearing in (14). Moreover, U is bounded bellow and has a Lipschitz gradient.

Proof. For simplicity, let s call u(z) = log η(z), w(z) = E Bθ (z), and denote by M an upper-bound on the Lipschitz constant of w and w which is guaranteed to be ﬁnite by assumption. Hence U(z)=u(z)+w(z). Equation (25) is equivalent to having

z u(z) 2λu(z) γ2

2u z 2 2λw(z) z w(z) 2A. (26)

Using that w is Lipschitz, we have that w(z) w(0) + M z and z w(z) M z . Hence, 2λw(z) z w(z) 2A 2λw(0)+(2λ+1)M z 2A. Therefore, a sufﬁcient condition for (26) to hold is

z u(z) 2λu(z) γ2

2u z 2 +(2λ+1)M z 2A+2λw(0). (27)

We will now rely on the strong convexity of u, which holds by assumption, and implies the existence of a positive constant m>0 such that

u(z) u(0) z u(z)+ m

z u(z) z u(0) +m z 2.

Published as a conference paper at ICLR 2021

This allows to write the following inequality,

z u(z) 2λu(z) γ2

2u (1 2λ)z u(z)+λ(m+ γ2

2u) z 2 2λu(0)

2u)) z 2 (1 2λ) z u(0) 2λu(0).

Combining the previous inequality with (27) and denoting M = u(0) , it is sufﬁcient to ﬁnd A and λ satisfying Å 1 λ Å m+ γ2

ãã z 2 (M +M +2λ(M M )) z 2λ(u(0)+w(0))+2A 0.

The l.h.s. in the above equation is a quadratic function in z and admits a global minimum when

2u ä 1 . The global minimum is always positive provided that A is large enough.

To see that U is bounded below, it sufﬁce to note, by Lipschitzness of w, that w(z) w(0) M z and by strong convexity of u that

u(z) u(0)+M z + m

Hence, U is lower-bounded by a quadratic function in z with positive leading coefﬁcient m

2 , hence it must be lower-bounded by a constant. Finally, by assumption, u and w have Lipschitz gradients, which directly implies that U has a Lipschitz gradient.

Proof of Proposition 3. By assumption KL(P||G) < + , this implies that P admits a density w.r.t. G which we call r(x). As a result P admits also a density w.r.t. Q given by:

Zexp(E (x))r(x).

We can then compute the KL(P||Q) explicitly:

KL(P||Q)=EP[E]+log(Z)+EP[log(r)] = LP,G(E )+KL(P||G).

Since 0 belongs to E and by optimality of E , we know that LP,G(E ) LP,G(0)=0. The result then follows directly.

C.2 TOPOLOGICAL AND SMOOTHNESS PROPERTIES OF KALE

Topological properties of KALE. Denseness and smoothness of the energy class E are the key to guarantee that KALE is a reliable criterion for measuring convergence. We thus make the following assumptions on E:

(A) For all E E, E E and there is CE > 0 such that c E E for 0 c CE. For any continuous function g, any compact support K in X and any precision ϵ > 0, there exists a ﬁnite linear combination of energies G=Pr i=1ai Ei such that supx K|f(x) G(x)| ϵ. (B) All energies E in E are Lipschitz in their input with the same Lipschitz constant L>0.

Assumption (A) holds in particular when E contains feedforward networks with a given number of parameters. In fact networks with a single neuron are enough, as shown in (Zhang et al., 2017, Theorem 2.3). Assumption (B) holds when additional regularization of the energy is enforced during training by methods such as spectral normalization Miyato et al. (2018) or gradient penalty Gulrajani et al. (2017) as done in Section 6. Proposition 4 states the topological properties of KALE ensuring that it can be used as a criterion for weak convergence. A proof is given in Appendix C.2.1 and is a consequence of (Zhang et al., 2017, Theorem B.1). Proposition 14. Under Assumptions (A) and (B) it holds that:

1. KALE(P||G) 0 with KALE(P||G)=0 if and only if P=G.

2. KALE(P||Gn) 0 if and only if Gn P under the weak topology.

Published as a conference paper at ICLR 2021

C.2.1 TOPOLOGICAL PROPERTIES OF KALE

In this section we prove Proposition 4. We ﬁrst start by recalling the required assumptions and make them more precise:

Assumption 2. Assume the following holds:

The set X is compact.

For all E E, E E and there is CE > 0 such that c E E for 0 c CE. For any continuous function g, any compact support K in X and any precision ϵ > 0, there exists a ﬁnite linear combination of energies G=Pr i=1ai Ei such that |f(x) G(x)| ϵ on K.

All energies E in E are Lipschitz in their input with the same Lipschitz constant L>0.

For simplicity we consider the set H = E +R, i.e.: H is the set of functions h of the form h = E +c where E E and c R. In all what follows P1 is the set of probability distributions with ﬁnite ﬁrst order moments. We consider the notion of weak convergence on P1 as deﬁned in (Villani, 2009, Deﬁnition 6.8) which is equivalent to convergence in the Wasserstein-1 distance W1.

Proof of Proposition 4 . We proceed by proving the separation properties (1st statement), then the metrization of the weak topology (2nd statement).

Separation. We have by Assumption 2 that 0 E, hence by deﬁnition KALE(PP||G) FP,G(0)=0. On the other hand, whenever P=G, it holds that:

FP,G(h)= Z (exp( h)+h 1)d P, h H.

Moreover, by convexity of the exponential, we know that exp( x)+x 1 0 for all x R. Hence, FP,G(h) FP,G(0) = 0 for all h H. This directly implies that KALE(P|G) = 0. For the converse, we will use the same argument as in the proof of (Zhang et al., 2017, Theorem B.1). Assume that KALE(P|G)=0 and let h be in H. By Assumption 2, there exists Ch >0 such that ch H and we have:

F(ch) KALE(P||G)=0.

Now dividing by c and taking the limit to 0, it is easy to see that R hd P+ R hd G 0. Again, by Assumption 2, we also know that h H, hence, R hd P R hd G 0. This necessarily implies that R hd P R hd G=0 for all h H. By the density of H in the set continuous functions on compact sets, we can conclude that the equality holds for any continuous and bounded function, which in turn implies that P=G.

Metrization of the weak topology. We ﬁrst show that for any P and G with ﬁnite ﬁrst moment, it holds that KALE(P|G) LW1(P,G), where W1(P,G) is the Wasserstein-1 distance between P and G. For any h H the following holds:

F(h)= Z hd P Z exp( h)d G+1

= Z h(x)d G(x) h(x )d P(x )

Z (exp( h)+h 1) | {z } 0

Z h(x)d G(x) h(x )d P(x ) LW1(P,G)

The ﬁrst inequality results from the convexity of the exponential while the last one is a consequence of h being L-Lipschitz. This allows to conclude that KALE(P||G) LW1(P,G) after taking the supremum over all h H. Moreover, since W1 metrizes the weak convergence on P1 (Villani, 2009, Theorem 6.9), it holds that whenever a sequence Gn converges weakly towards P in P1 we also have W1(P,Gn) 0 and thus KALE(P||Gn) 0. The converse is a direct consequence of (Liu et al., 2017, Theorem 10) since by assumption X is compact.

Published as a conference paper at ICLR 2021

Well-deﬁned learning. Assume that for any ϵ > 0 and any h and h in E there exists f in 2E such that h+h f ϵ then there exists a constant C such that:

KALE(P,Q) CKALE(P,G)

This means that the proposed learning procedure which ﬁrst ﬁnds the optimal energy E given a base G by maximum likelihood then minimizes KALE(P,G) ensures ends up minimizing the distance between the data end the generalized energy-based model Q.

KALE(P,Q)=sup h E LP,QG(h)

= KALE(P,G)+sup h E LP,G(h+E )

Let s choose ϵ = KALE(P,G) and let h 2E such that h+E f ϵ. We have by concavity of the function (α,β)7 LP,G(α(h+E f)+βf) we have that:

LP,G(h+E ) 2LP,G(1

2f) LP,G(h+E f)

By assumption, we have that h+E f ϵ, thus |LP,G(h+E f)| 2ϵ. Moreover, we have that LP,G( 1

2f) KALE(P,G) since 1

2f E. This ensures that:

LP,G(h+E ) 3KALE(P,G).

Finally, we have shown that:

KALE(P,Q) 2KALE(P,G).

Hence, minimizing KALE(P,G) directly minimizes KALE(P,Q).

C.2.2 SMOOTHNESS PROPERTIES OF KALE

We will now prove Theorem 5. We begin by stating the assumptions that will be used in this section:

(I) E is parametrized by a compact set of parameters Ψ.

(II) Functions in E are jointly continuous w.r.t. (ψ,x) and are L-lipschitz and L-smooth w.r.t. the input x:

Eψ(x) Eψ(x ) Le x x ,

x Eψ(x) x Eψ(x ) Le x x .

(III) (θ,z)7 Gθ(z) is jointly continuous in θ and z, with z 7 Gθ(z) uniformly Lipschitz w.r.t. z:

Gθ(z) Gθ(z ) Lb z z , z,z Z,θ Θ.

There exists non-negative functions a and b deﬁned from Z to R such that θ 7 Gθ(z) are a-Lipschitz and b-smooth in the following sense:

Gθ(z) Gθ (z) a(z) θ θ ,

θGθ(z) θGθ (z) b(z) θ θ .

Moreover, a and b are integrable in the following sense: Z a(z)2exp(2Le Lb z )dη(z)< , Z exp(Le Lb z )dη(z)< ,

Z b(z)exp(Le Lb z )dη(z)< .

Published as a conference paper at ICLR 2021

To simplify notation, we will denote by Lθ(f) the expected Gθ log-likelihood under P. In other words,

Lθ(E):=LP,Gθ(E)= Z Ed P log Z exp( E)d Gθ.

We also denote by p E,θ the density of the model w.r.t. Gθ,

p E,θ = exp( E)

ZGθ,E , ZGθ,E = Z exp( E)d Gθ.

We write K(θ):=KALE(P||Gθ) to emphasize the dependence on θ.

Proof of Theorem 5. To show that sub-gradient methods converge to local optima, we only need to show that K is Lipschitz continuous and weakly convex. This directly implies convergence to local optima for sub-gradient methods, according to Davis and Drusvyatskiy (2018); Thekumparampil et al. (2019). Lipschitz continuity ensures that K is differentiable for almost all θ Θ, and weak convexity simply means that there exits some positive constant C 0 such that θ 7 K(θ)+C θ 2 is convex. We now proceed to show these two properties.

We will ﬁrst prove that θ7 K(θ) is weakly convex in θ. By Lemma 15, we know that for any E E, the function θ 7 Lθ(E) is M-smooth for the same positive constant M. This directly implies that it is also weakly convex and the following inequality holds:

Lθt(E) t Lθ(E)+(1 t)Lθ (E)+ M

2 t(1 t) θ θ 2.

Taking the supremum w.r.t. E, it follows that

K(θt) t K(θ)+(1 t)K(θ )+ M

2 t(1 t) θ θ 2.

This means precisely that K is weakly convex in θ.

To prove that K is Lipschitz, we will also use Lemma 15, which states that Lθ(E) is Lipschitz in θ uniformly on E. Hence, the following holds:

Lθ(E) Lθ(E)+LC θ θ .

Again, taking the supremum over E, it follows directly that

K(θ) K(θ )+LC θ θ .

We conclude that K is Lipschitz by exchanging the roles of θ and θ to get the other side of the inequality. Hence, by the Rademacher theorem, K is differentiable for almost all θ.

We will now provide an expression for the gradient of K. By Lemma 16 we know that ψ7 Lθ(Eψ) is continuous and by Assumption (I) Ψ is compact. Therefore, the supremum sup E ELθ(E) is achieved for some function E θ. Moreover, we know by Lemma 15 that Lθ(E) is smooth uniformly on E, therefore the family ( θLθ(E))E E is equi-differentiable. We are in position to apply Milgrom and Segal (2002)(Theorem 3) which ensures that K(θ) admits left and right partial derivatives given by

+ e K(θ)= lim t>0 t 0 θLθ(E θ+te) e,

e K(θ)= lim t<0 t 0 θLθ(E θ+te) e, (28)

where e is a given direction in Rr. Moreover, the theorem also states that K(θ) is differentiable iff t7 E θ+te is continuous at t=0. Now, recalling that K(θ) is actually differentiable for almost all θ, it must hold that E θ+te t 0 E θ and + e K(θ)= e K(θ) for almost all θ. This implies that the two limits in (28) are actually equal to θLθ(E θ) e. The gradient of K, whenever deﬁned, in therefore given by

θK(θ)=Z 1 Gθ,E θ

Z x E θ(Gθ(z)) θGθ(z)exp( E θ(Gθ(z)))η(z)dz.

Published as a conference paper at ICLR 2021

Lemma 15. Under Assumptions (I) to (III), the functional Lθ(E) is Lipschitz and smooth in θ uniformly on E:

|Lθ(E) Lθ (E)| LC θ θ ,

θLθ(E) θLθ (E)) 2CL(1+L) θ θ .

Proof. By Lemma 16, we have that Lθ(E) is differentiable, and that

θLθ(E):= Z ( x E Gθ) θGθ(p E,θ Gθ)dη.

Lemma 16 ensures that θLθ(E) is bounded by some positive constant C that is independent from E and θ. This implies in particular that Lθ(E) is Lipschitz with a constant C. We will now show that it is also smooth. For this, we need to control the difference

D:= θLθ(E) θLθ (E) .

We have by triangular inequality:

D Z x E Gθ x E Gθ θGθ (p E,θ Gθ)dη | {z } I

+ Z x E Gθ θGθ θGθ (p E,θ Gθ)dη | {z } II

+ Z x E Gθ θGθ |p E,θ Gθ p E,θ Gθ |dη | {z } III

The ﬁrst term can be upper-bounded using Le-smoothness of E and the fact that Gθ is Lipschitz in θ:

I Le θ θ Z |a|2(p E,θ Gθ)dη

The last inequality was obtained by Lemma 17. Similarly, using that θGθ is Lipschitz, it follows by Lemma 17 that

II Le θ θ Z |b|(p E,θ Gθ)dη

Finally, for the last term III, we ﬁrst consider a path θt =tθ+(1 t)θ for t [0,1], and introduce the function s(t):=p E,θt Gθt. We will now control the difference p E,θ Gθ p E,θ Gθ , also equal to s(1) s(0). Using the fact that st is absolutely continuous we have that s(1) s(0)= R 1 0 s (t)dt. The derivative s (t) is simply given by s (t)=(θ θ ) (Mt Mt)s(t) where Mt =( x E Bθt) θGθt and Mt = R Mtp E,θt Gθtdη. Hence,

s(1) s(0)=(θ θ ) Z 1

0 (Mt Mt)s(t)dt.

We also know that Mt is upper-bounded by La(z), which implies

III L2 e θ θ Z 1

ÇZ |a(z)|2s(t)(z)dη(z)+ ÅZ a(z)s(t)(z)dη(z) ã2å

L2 e(C+C2) θ θ ,

where the last inequality is obtained using Lemma 17. This allows us to conclude that Lθ(E) is smooth for any E E and θ Θ.

Published as a conference paper at ICLR 2021

Lemma 16. Under Assumptions (II) and (III), it holds that ψ 7 Lθ(Eψ) is continuous, and that θ7 Lθ(Eψ) is differentiable in θ with gradient given by

θLθ(E):= Z ( x E Gθ) θGθ(p E,θ Gθ)dη.

Moreover, the gradient is bounded uniformly in θ and E:

ÅZ exp( Le Lb z )dη(z) ã 1Z a(z)exp(Le Lb z )dη(z).

Proof. To show that ψ7 Lθ(Eψ) is continuous, we will use the dominated convergence theorem. We ﬁx ψ0 in the interior of Ψ and consider a compact neighborhood W of ψ0. By assumption, we have that (ψ,x) 7 Eψ(x) and (ψ,z) 7 Eψ(Gθ(z)) are jointly continuous. Hence, |Eψ(0)| and |Eψ(Gθ(0))| are bounded on W by some constant C. Moreover, by Lipschitz continuity of x7 Eψ, we have |Eψ(x)| |Eψ(0)|+Le x C+Le x , exp( E(Gθ(z))) exp( E(Gθ(0)))exp(Le Lb z ) exp(C)exp(Le Lb z ). Recalling that P admits a ﬁrst order moment and that by Assumption (III), exp(Le Lb z ) is integrable w.r.t. η, it follows by the dominated convergence theorem and by composition of continuous functions that ψ7 Lθ(Eψ) is continuous in ψ0.

To show that θ 7 Lθ(Eψ) is differentiable in θ, we will use the differentiation lemma in (Klenke, 2008, Theorem 6.28). We ﬁrst ﬁx θ0 in the interior of Θ, and consider a compact neighborhood V of θ0. Since θ 7 |E(Gθ(0))| is continuous on the compact neighborhood V it admits a maximum value C; hence we have using Assumptions (II) and (III) that exp( E(Gθ(z))) exp( E(Gθ(0)))exp(Le Lb z ) exp(C)exp(Le Lb z ). Along with the integrability assumption in Assumption (III), this ensures that z 7 exp( E(Gθ(z))) is integrable w.r.t η for all θ in V . We also have that exp( E(Gθ(z))) is differentiable, with gradient given by θexp( E(Gθ(z)))= x E(Gθ(z)) θGθ(z)exp( E(Gθ(z))). Using that E is Lipschitz in its inputs and Gθ(z) is Lipschitz in θ, and combining with the previous inequality, it follows that θexp( E(Gθ(z))) exp(C)Lea(z)exp(Le Lb z ), where a(z) is the location dependent Lipschitz constant introduced in Assumption (III). The r.h.s. of the above inequality is integrable by Assumption (III) and is independent of θ on the neighborhood V . Thus (Klenke, 2008, Theorem 6.28) applies, and it follows that

Z exp( E(Gθ0(z)))dη(z)= Z x E(Gθ0(z)) θGθ0(z)exp( E(Gθ0(z)))dη(z).

We can now directly compute the gradient of Lθ(E),

θLθ(E)= ÅZ exp( E(Gθ0))dη ã 1Z x E(Gθ0) θGθ0exp( E(Gθ0))dη.

Since E and Gθ are Lipschitz in x and θ respectively, it follows that x E(Gθ0(z)) Le and θGθ0(z) a(z). Hence, we have

Z a(z)(p E,θ Gθ(z))dη(z).

Finally, Lemma 17 allows us to conclude that θLθ(E) is bounded by a positive constant C independently from θ and E.

Lemma 17. Under Assumptions (II) and (III), there exists a constant C independent from θ and E such that Z ai(z)(p E,θ Gθ(z))dη(z)<C, (29) Z b(z)(p E,θ Gθ(z))dη(z)<C,

Published as a conference paper at ICLR 2021

Cifar10 Unsupervised

Pixel CNN Oord et al. (2016) 65.93 Pixel IQN Ostrovski et al. (2018) 49.46 EBM Radford et al. (2015) 38.2 WGAN-GP Gulrajani et al. (2017) 36.4 NCSN Ho and Ermon (2016) 25.32 SNGAN Miyato et al. (2018) 21.7 GEBM (ours) 19.31

Cifar10 Supervised

Big GAN Donahue and Simonyan (2019) 14.73 SAGAN Zenke et al. (2017) 13.4

Image Net Conditional

Pixel CNN 33.27 Pixel IQN 22.99 EBM 14.31

Image Net Supervised

SNGAN 20.50 GEBM (ours) 13.94

Table 4: FID scores on Image Net and CIFAR-10.

Proof. By Lipschitzness of E and Gθ, we have exp( Le Lb z ) exp(E(Gθ(0)) E(Gθ(z)) exp(Le Lb z ), thus introducing the factor exp(E(Bθ0(0)) in (29) we get Z ai(z)(p E,θ Gθ(z))dη(e) Le

ÅZ exp( Le Lb z )dη(z) ã 1Z a(z)iexp(Le Lb z )dη(z),

Z b(z)(p E,θ Gθ(z))dη(z) Le

ÅZ exp( Le Lb z )dη(z) ã 1Z b(z)exp(Le Lb z )dη(z).

The r.h.s. of both inequalities is independent of θ and E, and ﬁnite by the integrability assumptions in Assumption (III).

D IMAGE GENERATION

Figures 3 and 4 show sample trajectories using Algorithm 3 with no friction γ =0 for the 4 datasets. It is clear that along the same MCMC chain, several image modes are explored. We also notice the transition from a mode to another happens almost at the same time for all chains and corresponds to the gray images. This is unlike Langevin or when the friction coefﬁcient γ is large as in Figure 5. In that case each chain remains within the same mode.

Table 4 shows further comparisons with other methods on Cifar10 and Image Net 32x32.

E DENSITY ESTIMATION

Figure Figure 7 (left) shows the error in the estimation of the log-partition function using both methods (KALE-DV and KALE-F). KALE-DV estimates the negative log-likelihood on each batch of size 100 and therefore has much more variance than KALE-F which maintains the amortized estimator of the log-partition function.

Figure Figure 7 (right) shows the evolution of the negative log-likelihood (NLL) on both training and test sets per epochs for Red Wine and Whitewine datasets. The error decreases steadily in the case of KALE-DV and KALE-F while the error gap between the training and test set remains controlled.

Published as a conference paper at ICLR 2021

Figure 3: Samples from the GEBM at different stages of sampling using Algorithm 3 and inverse temperature β = 1, on Celeb A (Left), Imagenet (Right). Each row represents a sampling trajectory from early stages (leftmost images) to later stages (rightmost images).

Larger gaps are observed for both direct maximum likelihood estimation and Contrastive divergence although the training NLL tends to decrease faster than for KALE.

F ALGORITHMS

Estimating the variational parameter. Optimizing (9) exactly over A yields (8), with the optimal A equal to A = log( 1

M PM m=1exp( E(Ym))). However, to maintain an amortized estimator of the log-partition we propose to optimize (9) iteratively using second order updates:

Ak+1 =Ak λ(exp(Ak Ak+1) 1), A0 = A0 (30)

where λ is a learning rate and Ak+1 is the empirical log-partition function estimated from a batch of new samples. By leveraging updates from previous iterations, A can yield much more accurate estimates of the log-partition function as conﬁrmed empirically in Figure 7 of Appendix E.

Tempered GEBM. It can be preferable to sample from a tempered version of the model by rescaling the energy E by an inverse temperature parameter β, thus effectively sampling from Q. High temperature regimes (β 0) recover the base model G while low temperature regimes (β ) essentially sample from minima of the energy E. As shown in Section 6, low temperatures tend to produce better sample quality for natural image generation tasks.

Training In Algorithm 1, we describe the general algorithm for training a GEBM which alternates between gradient steps on the energy and the generator. An additional regularization, denoted by I(ψ) is used to ensure conditions of Proposition 4 and Theorem 5 hold. I(ψ) can include L2 regularization over the parameters ψ, a gradient penalty as in Gulrajani et al. (2017) or Spectral normalization Miyato et al. (2018). The energy can be trained either using the estimator in (8) (KALE-DV) or the one in (9) (KALE-F) depending on the variable C.

Published as a conference paper at ICLR 2021

Figure 4: Samples from the GEBM at different stages of sampling using Algorithm 3 and inverse temperature β = 1, on Cifar10 and LSUN (Right). Each row represents a sampling trajectory from early stages (leftmost images) to later stages (rightmost images).

Sampling In Algorithm 3, we describe the MCMC sampler proposed in Sachs et al. (2017) which is a time discretization of (14).

Algorithm 2 Overdamped Langevin Algorithm

1: Input λ, γ, u,η,E,G 2: Ouput XT 3: Z0 η // Sample Initial latent from η. 4: for t=0,...,T do 5: Yt+1 zlogη(Zt) z E B(Zt) // Evaluating zlog(ν(Zt+1)) using (4). 6: Wt+1 N(0,I) // Sample standard Gaussian noise 7: Zt+1 Zt+λYt+1+

2λWt+1 8: end for 9: XT G(ZT )

Published as a conference paper at ICLR 2021

Figure 5: Samples from the tempered GEBM at different stages of sampling using langevin and inverse temperature β =100, on Cifar10 (Left), Imagenet (Middle-left), Celeb A (Middle-Right) and LSUN (Right). Each row represents a sampling trajectory from early stages (leftmost images) to later stages (rightmost images).

Algorithm 3 Kinetic Langevin Algorithm

1: Input λ, γ, u,η,E,G 2: Ouput XT 3: Z0 η // Sample Initial latent from η. 4: for t=0,...,T do 5: Zt+1 Zt+ λ

2 Vt 6: Yt+1 zlogη(Zt+1) z E B(Zt+1) // Evaluating zlog(ν(Zt+1)) using (4). 7: Vt+1 Vt+ uλ

2 Yt+1. 8: Wt+1 N(0,I) // Sample standard Gaussian noise 9: Vt+1 exp( γλ)Vt+ 1

u(1 exp( 2γλ))Wt+1

10: Vt+1 Vt+1+ uλ

2 Yt+1 11: Zt+1 Zt+1+ λ

2 Vt+1 12: end for 13: XT G(ZT )

Published as a conference paper at ICLR 2021

10 2 100 102

Temperature 1

Relative FID score

Effect of temperature

0 250 500 750 1000 MCMC iterations

FID score vs MCMC itearations

imagenet32 cifar10 lsun celeb A

Figure 6: Relative FID score: ratio between FID score of the GEBM QG,E and its base G. (Left) Evolution of the ratio for increasing temperature on the 4 datasets after 1000 iterations of (14). (Right) Evolution of the same ratio during MCMC iteration using (14).

0 500 1000 1500 2000 Epochs

Relative error

Log partition estimation

donsker kale

0 500 1000 1500 2000 Epochs

Negative Log Likelihood

Negative Log Likelihood

donsker cd ml kale

0 500 1000 1500 2000 Epochs

Relative error

Log partition estimation

0 500 1000 1500 2000 Epochs

Negative Log Likelihood

Negative Log Likelihood

donsker cd ml kale

Figure 7: (Left): Relative error |ˆc c | |ˆc|+|c | on the estimation of the ground truth log-partition function c by ˆc using either KALE-DV or KALE-F vs training Epochs on Red Wine (Top) and White Wine (Bottom) datasets. (Right): Negative log likelihood vs training epochs on both training and test set for 4 different learning methods (KALE-DV,KALE-F, CD and ML) on Red Wine dataset.

G EXPERIMENTAL DETAILS

In all experiments, we use regularization which is a combination of L2 norm and a variant of the gradient penalty Gulrajani et al. (2017). For the image generation tasks, we also employ spectral normalization Miyato et al. (2018). This is to ensure that the conditions in Proposition 4 and Theorem 5 hold. We pre-condition the gradient as proposed in Simsekli et al. (2020) to stabilize training, and to avoid taking large noisy gradient steps due to the exponential terms in (8) and (9). We also use the second-order updates in (30) for the variational constant c whenever it is learned.

G.1 IMAGE GENERATION

Published as a conference paper at ICLR 2021

z R100 N(0,I) dense Mg Mg 512 4 4, stride=2 deconv. BN 256 Re LU 4 4, stride=2 deconv. BN 128 Re LU

4 4, stride=2 deconv. BN 64 Re LU

3 3, stride=1 conv. 3 Tanh

Table 5: Base/Generator of SNGAN Conv Net: Mg =4.

RGB image x RM M 3

3 3, stride=1 conv 64 l Re LU 4 4, stride=2 conv 64 l Re LU 3 3, stride=1 conv 128 l Re LU 4 4, stride=2 conv 128 l Re LU 3 3, stride=1 conv 256 l Re LU 4 4, stride=2 conv 256 l Re LU 3 3, stride=1 conv 512 l Re LU

Table 6: Energy/Discriminator of SNGAN Conv Net: M =32.

RGB image x RM M 3

Res Block down 128 Res Block down 128

Res Block 128 Res Block 128

Re Lu Global sum pooling

Table 7: Energy/Discriminator of SNGAN Res Net.

z R100 N(0,I) dense, 4 4 256 Res Block up 256 Res Block up 256 Res Block up 256 BN, Re Lu, 3 3 conv, Tanh

Table 8: Base/Generator of SNGAN Res Net.

Network Architecture Table 5 and Table 6 show the network architectures used for the GEBM in the case of SNGAN Conv Net. Table 5 and Table 6 show the network architectures used for the GEBM in the case of SNGAN Res Net. The residual connections of each residual block consists of two convolutional layers proceeded by a Batch Normalization and Re LU activation: BN+Re LU+Conv+BN+Re LU+Conv as in (Miyato et al., 2018, Figure 8).

Training: We train both base and energy by alternating 5 gradient steps to learn the energy vs 1 gradient step to learn the base. For the ﬁrst two gradient iterations and after every 500 gradient iterations on base, we train the energy for 100 gradient steps instead of 5. We then train the model up to 150000 gradient iterations on the base using a batch-size of 128 and Adam optimizer Kingma and Ba (2014) with initial learning rate of 10 4 and parameters (0.5,.999) for both energy and base.

Scheduler: We decrease the learning rate using a scheduler that monitors the FID score in a similar way as in Bi nkowski et al. (2018); Arbel et al. (2018). More precisely, every 2000 gradient iterations on the base, we evaluate the FID score on the training set using 50000 generated samples from the base and check if the current score is larger than the score 20000 iterations before. The learning rate is decreased by a factor of 0.8 if the FID score fails to decrease for 3 consecutive times.

Sampling: For (DOT) Tanaka (2019), we use the following objective:

z 7 z zy+ϵ + 1 keff E G(z) (31)

where zy is sampled from a standard Gaussian, ϵ is a perturbation meant to stabilize sampling and keff is the estimated Lipschitz constant of E B. Note that (31) uses a ﬂipped sign for the E B compared to Tanaka (2019). This is because E plays the role of D where D is the discriminator in Tanaka (2019). Introducing the minus sign in (31) leads to a degradation in performance. We perform 1000 gradient iterations with a step-size of 0.0001 which is also decreased by a factor of 10 every 200 iterations as done for the proposed method. As suggested by the authors of Tanaka (2019) we perform the following projection for the gradient before applying it:

g g (g z) q z.

Published as a conference paper at ICLR 2021

We set the perturbation ϵ to 0.001 and keff to 1 which was also shown in Tanaka (2019) to perform well. In fact, we found that estimating the Lipschitz constant by taking the maximum value of E G(z) over 1000 latent samples according to η lead to higher values for keff: ( Cifar10: 9.4, Celeb A : 7.2, Image Net: 4.9, Lsun: 3.8). However, those higher values did not perform as well as setting keff =1.

For (IHM) Turner et al. (2019) we simply run the MCMC chain for 1000 iterations.

G.2 DENSITY ESTIMATION

Pre-processing We use code and pre-processing steps from Wenliang et al. (2019) which we describe here for completeness. For Red Wine and White Wine, we added uniform noise with support equal to the median distances between two adjacent values. That is to avoid instabilities due to the quantization of the datasets. For Hepmass and Mini Boone, we removed ill-conditioned dimensions as also done in Papamakarios et al. (2017). We split all datasets, except Hep Mass into three splits. The test split consists of 10% of the total data. For the validation set, we use 10% of the remaining data with an upper limit of 1000 to reduce the cost of validation at each iteration. For Hep Mass, we used the sample splitting as done in Papamakarios et al. (2017). Finally, the data is whitened before ﬁtting and the whitening matrix was computed on at most 10000 data points.

Regularization: We set the regularization parameter to 0.1 and use a combination of L2 norm and a variant of the gradient penalty Gulrajani et al. (2017):

dψ ψ 2+E î xfψ( X) 2ó

Network Architecture. For both base and energy, we used an NVP Dinh et al. (2016) with 5 NVP layers each consisting of a shifting and scaling layer with two hidden layers of 100 neurons. We do not use Batch-normalization.

Training: In all cases we use Adam optimizer with learning rate of 0.001 and momentum parameters (0.5,0.9). For both KALE-DV and KALE-F, we used a batch-size of 100 data samples vs 2000 generated samples from the base in order to reduce the variance of the estimation of the energy. We alternate 50 gradient steps on the energy vs 1 step on the base and further perform 50 additional steps on the energy for the ﬁrst two gradient iterations and after every 500 gradient iterations on base. For Contrastive divergence, each training step is performed by ﬁrst producing 100 samples from the model using 100 Langevin iterations with a step-size of 10 2 and starting from a batch of 100 data-samples. The resulting samples are then used to estimate the gradient of the of the loss.

For (CD), we used 100 Langevin iterations for each learning step to sample from the EBM. This translates into an improved performance at the expense of increased computational cost compared to the other methods. All methods are trained for 2000 epochs with batch-size of 100 (1000 on Hepmass and Miniboone datasets) and ﬁxed learning rate 0.001, which was sufﬁcient for convergence.

G.3 ILLUSTRATIVE EXAMPLE IN FIGURE 1

We consider parametric functions G(1) θ and G(2) θ from R to R of the form:

G(1) θ (x)=sin(8πWx)/(1+4πBx), G(2) θ (x)=4πW x+b

with θ =(W,B,W ,b). we also call θ =(1,1,1,0). In addition, we consider a sigmoid like function h from [0,1] to [0,1] of the form:

ez =tan(π(z 1

2), h(z)= 1

Å z+ 1 1+exp( 9ez)

Data generation : To generate a data point X = (X1,X2), we consider the following simple generative model:

Sample a uniform r.v. Z from [0,1]. Apply the distortion function h to get a latent sample Y =h(Z).

Published as a conference paper at ICLR 2021

Generate point X using X1 =G(1) θ (Y ) and X2 =G(2) θ (Y ).

Hence, the data are supported on the 1-d line deﬁned by the equation X2 =G(2) θ (X1).

GAN For the generator we sample Z uniformly from [0, 1] then generate a sample (X1,X2)=(G(1) θ (Z),G(2) θ (Z)). The goal is to learn θ.

For the discriminator, we used an MLP with 6 layers and 10 hidden units.

GEBM For the base we use the same generator as in the GAN model. For the energy we use the same MLP as discriminator of the GAN model.

EBM To ensure tractability of the likelihood, we use the following model:

X2|X1 N(G(2) θ (X1),σ0) X1 Mo G((µ1,σ1),(µ2,σ2))

Mo G((µ1,σ1),(µ2,σ2)) refers to a Mixture of two gaussians with mean and variances µi and σi. We learn each of the parameters (θ,σ0,µ1,σ1,µ2,σ2) by maximizing the likelihood.

Both GAN and GEBM have the capacity to recover the the exact support by ﬁnding the optimal parameter θ . For the EBM, when θ = θ , the mean Gθ (X1) of the conditional gaussian X2|X1 draws a line which matches the data support exactly, i.e.: X2 =G(2) θ (X1).

G.4 BASE/GENERATOR COMPLEXITY

To investigate the effect of model complexity of the performance gap between GANs and GEBMs, we performed additional experiments using the setting of Figure 1. Now we allow the generator/base network to better model the hidden transformation h that produces the ﬁrst coordinate X1 given the latent noise. We choose G(1) θ to be either a one hidden layer network or an MLP with 3 hidden layers both with leaky Re LU activation, instead of a simple linear transform as previously done in Appendix G.3. The network has universal approximation capability that depends on the number of units. This provides a direct control over the complexity of the generator/base. We then varied the number of hidden units from 1 to 5 104 units for the one hidden layer network and from 10 to 5 103 units per layer for the MLP. Note that the MLP with 5 103 units per layer stores a matrix of size 2.5 107 and thus contains 2 orders of magnitudes more parameters than the widest shallow network with 5 104 units. We then compared the performance of the GAN and GEBM using the Sinkhorn divergence Feydy et al. (2019) between each model and the data distribution. In all experiments, we used the same discriminator/energy network described in Appendix G.3. Results are provided in Figure 8.

Figure 8: Sinkhorn divergence between data generating distribution and the trained model (either GAN or GEBM) vs number of hidden units in G(1) θ . The left ﬁgure represents the one hidden layer network and right one is for the MLP. Each data point represents the average over 20 independent runs for each choice of number of hidden units.

Published as a conference paper at ICLR 2021

Estimating the Sinkhorn divergence. The Sinkhorn is computed using 6000 samples from the data and the model, with squared euclidean distance as a ground cost and using a regularization ϵ=1e 3. We then repeat the procedure 5 times and average the result to get the ﬁnal estimate of the Sinkhorn distance for a given run.

Training. Each run optimizes the parameters of the model using Adam optimizer (β1 =.5,β2 =.99), learning rate lr=1e 4 for the energy/discriminator and lr=1e 5 for the base/generator and weight decay of 1e 2 for the base/generator. Training is performed using KALE for 2000 epochs using a batch size of 5000 and 10 gradient iterations for the energy/discriminator per base/generator iteration. We use the gradient penalty for the energy/discriminator with a penalty parameter of 0.01. We then perform early stopping and retain the best performing model on a validation set.

Observations We make the following observations from Figure 8: the GAN generator indeed improves when we increase the number of hidden units. The performance of the GEBM remains stable as the number of hidden units increases. The performance of the GEBM is always better than the GAN, although we can see the GAN converging towards the GEBM. GEBM with a simpler base already outperforms the GAN with more powerful generators. The gap between the GEBM and the GAN reduces as the GAN becomes more expressive. Using a deeper network further reduces the gap compared to a shallow network.

These observations support the prior discussion that the energy witnesses a remaining difference between the generator and training samples, as long as it is not ﬂat. This information allows the GEBM to perform better than a GAN that ignores it. The performance gap between the GEBM and the GAN reduces as the generator becomes more powerful and forces the energy to be more ﬂat. This is consistent with the result in Proposition 3.