# generalized_energy_based_models__0b9ad2cf.pdf Published as a conference paper at ICLR 2021 GENERALIZED ENERGY BASED MODELS Michael Arbel , Liang Zhou & Arthur Gretton Gatsby Computational Neuroscience Unit, University College London We introduce the Generalized Energy Based Model (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally animplicitmodel), whichcanlearnthesupportofdatawithlowintrinsicdimensionin ahighdimensionalspace; andanenergyfunction, torefinetheprobabilitymassonthe learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the "generator"). GEBMs are trained by alternating between learning the energy and the base. We show that both training stages are well-defined: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model canbeobtainedvia MCMC,thusfindingregionsinthisspacethatproducebetterquality samples. Empirically, the GEBM samples on image-generation tasks are of much better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. When using normalizing flows as base measures, GEBMs succeed on density modelling tasks, returning comparable performance to direct maximum likelihood of the same networks. 1 INTRODUCTION Energy-based models (EBMs) have a long history in physics, statistics and machine learning (Le Cun et al., 2006). They belong to the class of explicit models, and can be described by a family of energies E which define probability distributions with density proportional to exp( E). Those models are often known up to a normalizing constant Z(E), also called the partition function. The learning task consists of finding an optimal function that best describes a given system or target distribution P. This can be achieved using maximum likelihood estimation (MLE), however the intractability of the normalizing partition function makes this learning task challenging. Thus, various methods have been proposed to address this (Hinton, 2002; Hyvärinen, 2005; Gutmann and Hyvärinen, 2012; Dai et al., 2019a;b). All these methods estimate EBMs that are supported over the whole space. In many applications, however, P is believed to be supported on an unknown lower dimensional manifold. This happens in particular when there are strong dependencies between variables in the data (Thiry et al., 2021), and suggests incorporating a low-dimensionality hypothesis in the model . Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are a particular way to enforce low dimensional structure in a model. They rely on an implicit model, the generator, to produce samples supported on a low-dimensional manifold by mapping a pre-defined latent noise to the sample space using a trained function. GANs have been very successful in generating high-quality samples on various tasks, especially for unsupervised image generation (Brock et al., 2018). The generator is trained adversarially against a discriminator network whose goal is to distinguish samples produced by the generator from the target data. This has inspired further research to extend the training procedure to more general losses (Nowozin et al., 2016; Arjovsky et al., 2017; Li et al., 2017; Bi nkowski et al., 2018; Arbel et al., 2018) and to improve its stability (Miyato et al., 2018; Gulrajani et al., 2017; Nagarajan and Kolter, 2017; Kodali et al., 2017). While the generator of a GAN has effectively a low-dimensional support, it remains challenging to refine the distribution of mass on that support using pre-defined latent noise. For instance, as shown by Cornish et al. (2020) for normalizing flows, when the latent distribution is unimodal and the target distribution possesses multiple disconnected low-dimensional components, the generator, as a continuous map, compensates for this mismatch using steeper slopes. In practice, this implies the need for more complicated generators. Correspondence: michael.n.arbel@gmail.com. Published as a conference paper at ICLR 2021 In the present work, we propose a new class of models, called Generalized Energy Based Models (GEBMs), which can represent distributions supported on low-dimensional manifolds, while offering more flexibility in refining the mass on those manifolds. GEBMs combine the strength of both implicit and explicit models in two separate components: a base distribution (often chosen to be an implicit model) which learns the low-dimensional support of the data, and an energy function that can refine the probability mass on that learned support. We propose to train the GEBM by alternating between learning the energy and the base, analogous to f-GAN training (Goodfellow et al., 2014; Nowozin et al., 2016). The energy is learned by maximizing a generalized notion of likelihood which we relate to the Donsker-Varadhan lower-bound (Donsker and Varadhan, 1975) and Fenchel duality, as in (Nguyen et al., 2010; Nowozin et al., 2016). Although the partition function is intractable in general, we propose a method to learn it in an amortized fashion without introducing additional surrogate models, as done in variational inference (Kingma and Welling, 2014; Rezende et al., 2014) or by Dai et al. (2019a;b). The resulting maximum likelihood estimate, the KL Approximate Lower-bound Estimate (KALE), is then used as a loss for training the base. When the class of energies is rich and smooth enough, we show that KALE leads to a meaningful criterion for measuring weak convergence of probabilities. Following recent work by Chu et al. (2020); Sanjabi et al. (2018), we show that KALE possesses well defined gradients w.r.t. the parameters of the base, ensuring well-behaved training. We also provide convergence rates for the empirical estimator of KALE when the variational family is sufficiently well behaved, which may be of independent interest. The main advantage of GEBMs becomes clear when sampling from these models: the posterior over the latents of the base distribution incorporates the learned energy, putting greater mass on regions in this latent space that lead to better quality samples. Sampling from the GEBM can thus be achieved by first sampling from the posterior distribution of the latents via MCMC in the low-dimensional latent space, then mapping those latents to the input space using the implicit map of the base. This is in contrast to standard GANs, where the latents of the base have a fixed distribution. We focus on a class of samplers that exploit gradient information, and show that these samplers enjoy fast convergence properties by leveraging the recent work of Eberle et al. (2017). While there has been recent interest in using the discriminator to improve the quality of the generator during sampling (Azadi et al., 2019; Turner et al., 2019; Neklyudov et al., 2019; Grover et al., 2019; Tanaka, 2019; Wu et al., 2019b), our approach emerges naturally from the model we consider. We begin in Section 2 by introducing the GEBM model. In Section 3, we describe the learning procedure using KALE, then derive a method for sampling from the learned model in Section 4. In Section 5 we discuss related work. Finally, experimental results are presented in Section 6 with code available at https://github.com/Michael Arbel/Generalized EBM. 2 GENERALIZED ENERGY-BASED MODELS Figure 1: Data generating distribution supported on a line and with higher density at the extremities. Models are learned using either a GAN, GEBM, or EBM. More details are provided in Appendix G.3. In this section, we introduce generalized energy based models (GEBM), that combine the strengths of both energy-based models and implicit generative models, and admit the first of these as a special case. An energy-based model (EBM) is defined by a set E of real valued functions called energies, where each E E specifies a probability density over the data space X Rd up to a normalizing constant, Q(dx)=exp( E(x) A)dx, A=log ÅZ exp( E(x))dx ã . (1) While EBMs have been shown recently to be powerful models for representing complex high dimensional data distributions, they still unavoidably lead to a blurred model whenever data are concentrated on a lower-dimensional manifold. This is the case in Figure 1(a), where the ground truth distribution is Published as a conference paper at ICLR 2021 supported on a 1-D line and embedded in a 2-D space. The EBM in Figure 1(d) learns to give higher density to a halo surrounding the data, and thus provides a blurred representation. That is a consequence of EBM having a density defined over the whole space, and can result in blurred samples for image models. An implicit generative model (IGM) is a family of probability distributions Gθ parametrized by a learnable generator function G : Z 7 X that maps latent samples z from a fixed latent distribution η to the data space X. The latent distribution η is required to have a density over the latent space Z and is often easy to sample from. Thus, Sampling from G is simply achieved by first sampling z from η then applying G, x G x=G(z), z η. (2) GANs are popular instances of these models, and are trained adversarially (Goodfellow et al., 2014). When the latent space Z has a smaller dimension than the input space X, the IGM will be supported on a lower dimensional manifold of X, and thus will not possess a Lebesgue density on X (Bottou et al., 2017). IGMs are therefore good candidates for modelling low dimensional distributions. While GANs can accurately learn the low-dimensional support of the data, they can have limited power for representing the distribution of mass on the support. This is illustrated in Figure 1(b). A generalized energy-based model (GEBM) Q is defined by a combination of a base G and an energy E defined over a subset X of Rd. The base component can typically be chosen to be an IGM as in (2). The generalized energy component can refine the mass on the support defined by the base. It belongs to a class E of real valued functions defined on the input space X, and represents the negative log-density of a sample from the GEBM with respect to the base G, Q(dx)=exp( E(x) AG,E)G(dx), AG,E =log ÅZ exp( E(x))G(dx) ã , (3) where AG,E is the logarithm of the normalizing constant of the model w.r.t. G. Thus, a GEBM Q re-weights samples from the base according to the un-normalized importance weights exp( E(x)). Using the latent structure of the base G, this importance weight can be pulled-back to the latent space to define a posterior latent distribution ν, ν(z):=η(z)exp( E(G(z)) AG,E). (4) Hence, the posterior latent ν can be used instead of the latent noise η for sampling from Q, as summarized by Proposition 1: Proposition 1. Sampling from Q requires sampling a latent z from ν (4) then applying the map G, x Q x=G(z), z ν. (5) In order to hold, Proposition 1 does not need the generator G to be invertible. We provide a proof in Appendix C.1 which relies on a characterization of probability distribution using generalized moments. We will see later in Section 4 how equation (5) can be used to provide practical sampling algorithms from the GEBM. Next we discuss the advantages of GEBMs. Advantages of Generalized Energy Based Models. The GEBM defined by (3) can be related to exponential tilting (re-weighting) (Siegmund, 1976; Xie et al., 2016) of the base G. The important difference over classical EBMs is that the base G is allowed to change its support and shape in space. By learning the base G, GEBMs can accurately learn the low-dimensional support of data, just like IGMs do. They also benefit from the flexibility of EBMs for representing densities using an energy E to refine distribution of mass on the support defined by G, as seen in Figure 1(c). Compared to EBMs, that put mass on the whole space by construction (positive density), GEBMs have the additional flexibility to concentrate the probability mass on a low-dimensional support learned by the base G, provided that the dimension of the latent space Z is smaller than the dimension of the ambient space X: see Figure 1(c) vs Figure 1(d). In the particular case when the dimension of Z is equal to the ambient dimension and G is invertible, the base G becomes supported over the whole space X, and GEBM recover usual EBMs. The next proposition further shows that any EBM can be viewed as a particular cases of GEBMs, as proved in Appendix C.1. Proposition 2. Any EBM with energy E (as in (1)) can be expressed as a GEBM with base G given as a normalizingflowwithdensityexp( r(x))andageneralizedenergy E(x)=E(x) r(x). Inthisparticular case, the dimension of the latent is necessarily equal to the data dimension, i.e. dim(Z)=dim(X). Published as a conference paper at ICLR 2021 Compared to IGMs, that rely on a fixed pre-determined latent noise distribution η, GEBMs offer the additional flexibility of learning a richer latent noise distribution. This is particularly useful when the data is multimodal. In IGMs, such a GANs, the latent noise η is usually unimodal thus requiring a more sophisticated generator to distort a unimodal noise distribution into a distribution with multiple modes, as shown by Cornish et al. (2020). Instead, GEBMs allow to sample from a posterior ν over the latent noise defined in (4). This posterior noise can be multimodal in latent space (by incorporating information from the energy) and thus can put more or less mass in specific regions of the manifold defined by the base G. This allows GEBMs to capture multimodality in data, provided the support of the base is broad enough to subsume the data support Figure 1(c). The base can be simpler, compared to GANs, as it doesn t need to distort the input noise too much to produce multimodal samples (see Figure 8 in Appendix G.4). This additional flexibility comes at no additional training cost compared to GANs. Indeed, GANs still require another model during training, the discriminator network, but do not use it for sampling. Instead, GEBMs avoid this waist since the base and energy can be trained jointly, with no other additional model, and then both are used for sampling. 3 LEARNING GEBMS In this section we describe a general procedure for learning GEBMs. We decompose the learning procedure into two steps: an energy learning step and a base learning step. The overall learning procedure alternates between these two steps, as done in GAN training (Goodfellow et al., 2014). 3.1 ENERGY LEARNING When the base G is fixed, varying the energy E leads to a family of models that all admit a density exp( E AG,E) w.r.t. G. When the base G admits a density exp( r) defined over the whole space, it is possible to learn the energy E by maximizing the likelihood of the model R (E+r)d P AG,E. However, in general G is supported on a lower-dimensional manifold so that r is ill-defined and the usual notion of likelihood cannot be used. Instead, we introduce a generalized notion of likelihood which does not require a well defined density exp( r) for G: Definition 1 (Generalized Likelihood). The expected G-log-likelihood under a target distribution P of a GEBM model Q with base G and energy E is defined as LP,G(E):= Z E(x)d P(x) AG,E. (6) To provide intuitions about the generalized likelihood in Definition 1, we start by discussing the particular case where KL(P||G) < + . We then present the training method in the general case where P and G might not share the same support, i.e. KL(P||G)=+ . Special case of finite KL(P||G). When the Kullback-Leibler divergence between P and G is well defined, (6) corresponds to the Donsker-Varadhan (DV) lower bound on the KL (Donsker and Varadhan, 1975), meaning that KL(P||G) LP,G(E) for all E. Moreover, the following proposition holds: Proposition 3. Assume that KL(P||G)<+ and 0 E. If, in addition, E maximizes (6), then: KL(P||Q) KL(P||G). (7) In addition, we have that KL(P||Q)=0 when E is the negative log-density ratio of P w.r.t. G. We refer to Appendix C.1 for a proof. According to (7), the GEBM systematically improves over the IGM defined by G, with no further improvement possible in the limit case when G=P. Hence as long as there is an error in mass on the common support of P and G, the GEBM improves over the base G. Estimating the likelihood in the General setting. Definition 1 can be used to learn a maximum likelihood energy E by maximizing LP,G(E) w.r.t. E even when the KL(P||G) is infinite and when P and G don t necessarily share the same support. Such an optimal solution is well defined whenever the set of energies is suitably constrained. This is the case if the energies are parametrized by a compact set Ψ with ψ7 Eψ continuous over Ψ. Estimating the likelihood is then achieved using i.i.d. samples (Xn)1:N,(Ym)1:M from P and G (Tsuboi et al., 2009; Sugiyama et al., 2012; Liu et al., 2017): ˆLP,G(E)= 1 n=1 E(Xn) log m=1 exp( E(Ym)) Published as a conference paper at ICLR 2021 In the context of mini-batch stochastic gradient methods, however, M typically ranges from 10 to 1000, which can lead to a poor estimate for the log-partition function AG,E . Moreover, (8) doesn t exploit estimates of AG,E from previous gradient iterations. Instead, we propose an estimator which introduces a variational parameter A R meant to estimate AG,E in an amortized fashion. The key idea is to exploit the convexity of the exponential which directly implies AG,E A exp( A+AG,E)+1 for any A R, with equality only when A=AG,E. Therefore, (6) admits a lower-bound of the form LP,G(E) Z (E+A)d P Z exp( (E+A))d G+1:=FP,G(E+A), where we introduced the functional FP,G for concision. Maximizing FP,G(E +A) over A recovers the likelihood LP,G(E). Moreover, jointly maximizing over E and A yields the maximum likelihood energy E and its corresponding log-partition function A =AG,E . This optimization is well-suited for stochastic gradient methods using the following estimator Kanamori et al. (2011): ˆFP,G(E+A)= 1 n=1 (E(Xn)+A) 1 m=1 exp( (E(Ym)+A))+1. (9) 3.2 BASE LEARNING Unlike in Section 3.1, varying the base G does not need to preserve the same support. Thus, it is generally not possible to use maximum likelihood methods for learning G. Instead, we propose to use the generalized likelihood (6) evaluated at the optimal energy E as a meaningful loss for learning G, and refer to it as the KL Approximate Lower-bound Estimate (KALE), KALE(P||G)= sup (E,A) E R FP,G(E+A). (10) From Section 3.1, KALE(P||G) is always a lower bound on KL(P,G). The bound becomes tight whenever the negative log density of P w.r.t. G is well-defined and belongs to E (Appendix A). Moreover, Proposition 4 shows that KALE is a reliable criterion for measuring convergence, and is a consequence of (Zhang et al., 2017, Theorem B.1), with a proof in Appendix C.2.1: Proposition 4. Assume all energies in E are L-Lipschitz and that any continuous function can be well approximated by linear combinations of energies in E (Assumptions (A) and (B) of Appendix C.2), then KALE(P||G) 0 with equality only if P=G and KALE(P||Gn) 0 iff Gn P in distribution. The universal approximation assumption holds in particular when E contains feedforward networks. In fact networks with a single neuron are enough, as shown in (Zhang et al., 2017, Theorem 2.3). The Lipschitz assumption holds when additional regularization of the energy is enforced during training by methods such as spectral normalization (Miyato et al., 2018) or additional regularization I(ψ) on the energy Eψ such as the gradient penalty (Gulrajani et al., 2017) as done in Section 6. Estimating KALE. According to Arora et al. (2017), accurate finite sample estimates of divergences that result from an optimization procedures (such as in (10)) depend on the richness of the class E; and richer energy classes can result in slower convergence. Unlike divergences such as Jensen-Shannon, KL and the Wasserstein distance, which result from optimizing over a non-parametric and rich class of functions, KALE is restricted to a class of parametric energies Eψ. Thus, (Arora et al., 2017, Theorem 3.1) applies, and guarantees good finite sample estimates, provided optimization is solved accurately. In Appendix B, we provide an analysis for the more general case where energies are not necessarily parametric but satisfy some further smoothness properties; we emphasize that our rates do not require the strong assumption that the density ratio is bounded above and below as in (Nguyen et al., 2010). Smoothness of KALE. Learning the base is achieved by minimizing K(θ):=KALE(P||Gθ) over the set of parameters Θ of the generator Gθ using first order methods (Duchi et al., 2011; Kingma and Ba, 2014; Arbel et al., 2019). This requires K(θ) to be smooth enough so that gradient methods converge to local minima and avoid instabilities during training (Chu et al., 2020). Ensuring smoothness of losses that result from an optimization procedure, as in (10), can be challenging. Results for the regularized Wasserstein are provided by Sanjabi et al. (2018), while more general losses are considered by Chu et al. (2020), albeit under stronger conditions than for our setting.Theorem 5 shows that when E, Gθ and their gradients are all Lipschitz then K(θ) is smooth enough.We provide a proof for Theorem 5 in Appendix C.2.1. Published as a conference paper at ICLR 2021 Theorem 5. Under Assumptions (I) to (III) of Appendix C.2, sub-gradient methods on K converge to local optima. Moreover, K is Lipschitz and differentiable for almost all θ Θ with: K(θ)=exp( AGθ,E ) Z x E (Gθ(z)) θGθ(z)exp( E (Gθ(z)))η(z)dz. (11) Estimating the gradient in (11) is achieved by first optimizing over Eψ and Ausing(9), withadditionalregularization I(ψ). The resulting estimators ˆE and ˆA are plugged in (12) to estimate K(θ) using samples (Zm)1:M from η. Unlike for learning the energy E , which benefits from using the amortized estimator of the log-partition function, we found that using the empirical log-partition for learning the base was more stable. We summarize the training procedure in Algorithm 1, which alternates between learning the energy and the base in a similar fashion to adversarial training. Algorithm 1 Training GEBM 1: Input P, N,M, nb, ne 2: Output Trained generator Gθ and energy Eψ. 3: Initialize θ , ψ and A. 4: for k=1,...,nb do 5: for j =1,...,ne do 6: Sample {Xn}1:N P and {Yn}1:N Gθ 7: gψ ψ ˆFP,Gθ(Eψ+A)+I(ψ) 8: A log Ä 1 M PM m=1exp( Eψ(Ym)) ä 9: g A exp(A A) 1 10: Update ψ and A using gψ and g A. 11: end for 12: Set ˆE Eψ and ˆA A. 13: Update θ using K(θ) from (12) 14: end for K(θ)= exp( ˆA ) m=1 x ˆE (Gθ(Zm)) θGθ(Zm)exp( ˆE (Gθ(Zm))). (12) 4 SAMPLING FROM GEBMS A simple estimate of the empirical distribution of observations under the GEBM is via importance sampling (IS). This consists in first sampling multiple points from the base G, and then re-weighting the samples according to the energy E. Although straightforward, this approach can lead to highly unreliable estimates, a well known problem in the Sequential Monte Carlo (SMC) literature which employs IS extensively (Doucet et al., 2001; Del Moral et al., 2006). Other methods such as rejection sampling are known to be inefficient in high dimensions Haugh (2017). Instead, we propose to sample from the posteriorν using MCMC. Recall from (5) that a sample x from Q is of the form x=G(z) with z sampled from the posteriorlatent ν of(4)insteadofthepriorη. Whilesamplingfromη isoftenstraightforward(forinstanceif η is a Gaussian), sampling from ν is generally harder, due to dependence of its density on complex functions E and G. Itisstillpossibletouse MCMCmethodstosamplefromν, however, sincewehaveaccess to its density up to a normalizing constant (4). In particular, we are interested in methods that exploit the gradient of ν, and consider two classes of samplers: Overdamped samplers and Kinetic samplers. Overdamped samplers are obtained as a time-discretization of the Overdamped Langevin dynamics: dzt =( zlogη(zt) z E(G(zt)))+ where wt is a standard Brownian motion. The simplest sampler arising from (13) is the Unadjusted Langevin Algorithm (ULA): Zk+1 =Zk+λ( zlogη(Zk) z E(G(Zk)))+ 2λWk+1, Z0 η, where (Wk)k 0 are i.i.d. standard Gaussians and λ is the step-size. For large k, Zk is an approximate sample from ν (Raginsky et al., 2017, Proposition 3.3). Hence, setting X =G(Zk) for a large enough k provides an approximate sample from the GEBM Q, as summarized in Algorithm 2 of Appendix F. Kinetic samplers arise from the Kinetic Langevin dynamics which introduce a momentum variable: dzt =vtdt, dvt = γvtdt+u( logη(zt) E(G(zt)))dt+ p 2γudwt. (14) with friction coefficient γ 0, inverse mass u 0, momentum vector vt and standard Brownian motion wt. When the mass u 1 becomes negligible compared to the friction coefficient γ, i.e. uγ 2 0, standard results show that (14) recovers the Overdamped dynamics (13). Discretization in time of (14) Published as a conference paper at ICLR 2021 leads to Kinetic samplers similar to Hamiltonian Monte Carlo (Cheng et al., 2017; Sachs et al., 2017). We consider a particular algorithm from Sachs et al. (2017) which we call Kinetic Langevin Algorithm (KLA) (see Algorithm 3 in Appendix F). Kinetic samplers were shown to better explore the modes of the invariant distribution ν compared to Overdamped ones (see (Neal, 2010; Betancourt et al., 2017) for empirical results and (Cheng et al., 2017) for theory), as also confirmed empirically in Appendix D for image generation tasks using GEBMs. Next, we provide the following convergence result: Proposition 6. Assume that logη(z) is strongly concave and has a Lipschitz gradient, that E, G and their gradients are all L-Lipschitz. Set xt =G(zt), where zt is given by (14) and call Pt the probability distribution of xt. Then Pt converges to Q in the Wasserstein sense, W2(Pt,Q) LCe cγt, where c and C are positive constants independent of t, with c=O(exp( dim(Z))). Proposition 6 is proved in Appendix C.1 using (Eberle et al., 2017, Corollary 2.6), and implies that (xt)t 0 converges at the same speed as (zt)t 0. When the dimension q of Z is orders of magnitude smaller than the input space dimension d, the process (xt)t 0 converges faster than typical sampling methods on X, for which the exponent controlling the convergence rate is of order O(exp( d)). 5 RELATED WORK Energy based models. Usually, energy based models are required to have a density w.r.t. to a Lebesgue measure, and do not use a learnable base measure; in other words, models are supported on the whole space. Various methods have been proposed in the literature to learn EBMs. Contrastive Divergence (Hinton, 2002) approximates the gradient of the log-likelihood by sampling from the energy model with MCMC. More recently, (Belanger and Mc Callum, 2016; Xie et al., 2016; 2017; 2018c; 2019; Tu and Gimpel, 2018; Du and Mordatch, 2019; Deng et al., 2020) extend the idea using more sophisticated models and MCMC sampling strategies that lead to higher quality estimators. Score Matching (Hyvärinen, 2005) calculates an alternative objective (the score) to the log-likelihood which is independent of the partition function, and was recently used in the context non-parametric energy functions to provide estimators of the energy that are provably consistent (Sriperumbudur et al., 2017; Sutherland et al., 2018; Arbel and Gretton, 2018; Wenliang et al., 2019). In Noise-Contrastive Estimation (Gutmann and Hyvärinen, 2012), a classifier is trained to distinguish between samples from a fixed proposal distribution and the target P. This provides an estimate for the density ratio between the optimal energy model and the proposal distribution. In a similar spirit, Cranmer et al. (2016) uses a classifier to learn likelihood ratios. Conversely, Grathwohl et al. (2020) interprets the logits of a classifier as an energy model obtained after marginalization over the classes. The resulting model is then trained using Contrastive Divergence. In more recent work, Dai et al. (2019a;b) exploit a dual formulation of the logarithm of the partition function as a supremum over the set of all probability distributions of some functional objective. Yu et al. (2020) explore methods for using general f-divergences, such as Jensen-Shannon, to train EBMs. Generative Adversarial Networks. Recent work proposes using the discriminator of a trained GAN to improve the generator quality. Rejection sampling (Azadi et al., 2019) and Metropolis Hastings correction (Turner et al., 2019; Neklyudov et al., 2019) perform sampling directly on the high-dimensional input space without using gradient information provided by the discriminator. Moreover, the data distribution is assumed to admit a density w.r.t. the generator. Ding et al. (2019) perform sampling on the feature space of some auxiliary pre-trained network; while Lawson et al. (2019) treat the sampling procedure as a model on its own, learned by maximizing the ELBO. In our case, no auxiliary model is needed. In the present work, sampling doesn t interfere with training, in contrast to recently considered methods to optimize over the latent space during training Wu et al. (2019b;a). In Tanaka (2019), the discriminator is viewed as an optimal transport map between the generator and the data distribution and is used to compute optimized samples from latent space. This is in contrast to the diffusion-based sampling that we consider. In (Xie et al., 2018b;a), two independent models, a full support EBM and a generator network, are trained cooperatively using MCMC. By contrast, in the present work, the energy and base are part of the same model, and the model support is lower-dimensional than the target space X. While we do not address the mode collapse problem, Xu et al. (2018); Nguyen et al. (2017) showed that KL-based losses are resilient to it thanks to the zero-avoiding property of the KL, a good sign for KALE which is derived from KL by Fenchel duality. Published as a conference paper at ICLR 2021 The closest related approach appears in a study concurrent to the present work (Che et al., 2020), where the authors propose to use Langevin dynamics on the latent space of a GAN generator, but with a different discriminator to ours (derived from the Jensen-Shannon divergence or a Wasserstein-based divergence). Our theory results showing the existence of the loss gradient (Theorem 5), establishing weak convergence of distributions under KALE (Proposition 4), and demonstrating consistency of the KALE estimator (Appendix B) should transfer to the JS and Wasserstein criteria used in that work. Subsequent to the present work, an alternative approach has been recently proposed, based on normalising flows, to learn both the low-dimensional support of the data and the density on this support (Brehmer and Cranmer, 2020). This approach maximises the explicit likelihood of a data projection onto a learned manifold, and may be considered complementary to our approach. 6 EXPERIMENTS Figure 2: Samples at different iterations of the MCMC chain of Algorithm 3 (left to right). 6.1 IMAGE GENERATION. Experimental setting. We train a GEBM on unsupervised image generation tasks, and compare the quality of generated samples with other methods using the FID score (Heusel et al., 2017) computed on 5 104 generated samples. We consider CIFAR-10 (Krizhevsky, 2009), LSUN (Yu et al., 2015), Celeb A (Liu et al., 2015) and Image Net (Russakovsky et al., 2014) all downsampled to 32x32 resolution to reduce computational cost. We consider two network architectures for each of the base and energy, a smaller one (SNGAN Conv Net) and a larger one (SNGAN Res Net), both of which are from Miyato et al. (2018). For the base we used the SNGAN generator networks from Miyato et al. (2018) with a 100-dimensional Gaussian for the latent noise η. For the energy we used the SNGAN discriminator networks from Miyato et al. (2018). (Details of the networks in Appendix G.1). We train the models for 150000 generator iterations using Algorithm 1. After training is completed, we rescale the energy by β =100 to get a colder version of the GEBM and sample from it using either Algorithm 2 (ULA) or Algorithm 3 (KLA) with parameters (γ =100,u=1). This colder temperature leads to an improved FID score, and needs relatively few MCMC iterations, as shown in Figure 6 of Appendix D. Sampler convergence to visually plausible modes at low tempteratures is demonstrated in Figure 2. We perform 1000 MCMC iterations with initial step-size of λ=10 4 decreased by 10 every 200 iterations. As a baseline we consider samples generated from the base of the GEBM only (without using information from the energy) and call this KALE-GAN. More details are given in Appendix G. Results: Table 1 shows that GEBM outperforms both KALE and standard GANs when using the same networks for the base/generator and energy/critic. Moreover, KALE-GAN matches the performance of a standard GAN (with Jensen-Shannon critic), showing that the improvement of GEBM cannot be explained by the switch from Jensen-Shannon to a KALE-based critic. Rather, the improvement is largely due to incorporating the energy function into the model, and sampling using Algorithm 3. This finding experimentally validates our claim that incorporating the energy improves the model, and that all else being equal, a GEBM outperforms a GAN with the same generator and critic architecture. Indeed, if the critic is not zero at convergence, then by definition it contains information on the remaining mismatch between the generator (base) and data mass, which the GEBM incorporates, but the GAN does not. The GEBM also outperforms an EBM even when the latter was trained using a larger network (Res Net) with supervision (S) on Image Net, which is an easier task ( Chen et al. (2019)). More comparisons on Cifar10 and Image Net are provided in Table 4 of Appendix D. SNGAN (Conv Net) SNGAN (Res Net) GEBM KALE-GAN GAN GEBM KALE-GAN GAN EBM Cifar10 23.02 32.03 29.9 19.31 20.19 21.7 38.2 Image Net 13.94 19.37 20.66 20.33 21.00 20.50 14.31 (S) Table 1: FID scores for two versions of SNGAN from (Miyato et al., 2018) on Cifar10 and Image Net. GEBM: training using Algorithm 1 and sampling using Algorithm 3. KALE-GAN: Only the base of a GEBM is retained for sampling. GAN: training as in (Miyato et al., 2018) with q =128 for the latent dimension as it worked best. EBM: results from Du and Mordatch (2019) with supervised training on Image Net (S). Published as a conference paper at ICLR 2021 Table 2 shows different sampling methods using the same trained networks (generator and critic), with KALE-GAN as a baseline. All energy-exploiting methods outperform the unmodified KALE-GAN with the same architecture. That said, our method (both ULA and KLA) outperforms both (IHM) (Turner et al., 2019) and (DOT) (Tanaka, 2019), which both use the energy information. Cifar10 LSUN Celeb A Image Net KALE-GAN 32.03 21.67 6.91 19.37 IHM 30.47 20.63 6.39 18.15 DOT 26.35 20.41 5.93 16.21 GEBM (ULA) 23.02 16.23 5.21 14.00 GEBM (KLA) 24.29 15.25 5.38 13.94 Table 2: FID scores for different sampling methods using the same trained SNGAN (Conv Net): KALE-GAN as a baseline w/o critic information. In Table 2, KLA was used in the high friction regime γ =100 and thus behaves like ULA. This allows to obtain sharper samples concentrated around the modes of the GEBM thus improving the FID score. If, instead, the goal is to encourage more exploration of the modes of the GEBM, then KLA with a smaller γ is a better alternative than ULA, as the former can explore multiple modes/images within the same MCMC chain, unlike (ULA): see Figures 3 to 5 of Appendix D. Moving from one mode to another results in an increased FID score while between modes, however, which can be avoided by decreasing λ. 6.2 DENSITY ESTIMATION Motivation. We next consider the particular setting where the likelihood of the model is well-defined, and admits a closed form expression. This is intended principally as a sanity check that our proposed training method in Algorithm 1 succeeds in learning maximum likelihood solutions. Outside of this setting, closed form expressions of the normalizing constant are not available for generic GEBMs. While this is not an issue (since the proposed method doesn t require a closed form expression for the normalizing constant), in this experiment only, we want to have access to closed form expressions, as they enable a direct comparison with other density estimation methods. Experimental setting. To have a closed-form likelihood, we consider the case where the dimension of the latent space is equal to data-dimension, and choose the base G of the GEBM to be a Real NVP (Ding et al. (2019) ) with density exp( r(x)) and energy E(x)=h(x) r(x). Thus, in this particular case, the GEBM has a well defined likelihood over the whole space, and we are precisely in the setting of Proposition 2, which shows that this GEBM is equal to an EBM with density proportional to exp( h). We further require the EBM to be a second Real NVP so that its density has a closed form expression. We consider 5 UCI datasets (Dheeru and Taniskidou, 2017) for which we use the same pre-processing as in (Wenliang et al., 2019). For comparison, we train the EBM by direct maximum likelihood (ML) and contrastive divergence (CD). To train the GEBM, we use Algorithm 1, which doesn t directly exploit the closed-form expression of the likelihood (unlike direct ML). We thus use either (8) (KALE-DV) or (9) (KALE-F) to estimate the normalizing constant. More details are given in Appendix G.2. Results. Table 3 reports the Negative Log-Likelihood (NLL) evaluated on the test set and corresponding tothebestperformanceonthevalidationset. Trainingthe GEBMusing Algorithm1leadstocomparable performance to (CD) and (ML). As shown in Figure 7 of Appendix E, (KALE-DV) and (KALEF) maintain a small error gap between the training and test NLL and, as discussed in Section 3.1 and Appendix F, (KALE-F) leads to more accurate estimates of the log-partition function, with a relative error of order 0.1% compared to 10% for (KALE-DV). Red Wine d=11,N 103 Whitewine d=11,N 103 Parkinsons d=15,N 103 Hepmass d=22,N 105 Miniboone d=43,N 104 NVP w ML 11.98 13.05 14.5 24.89 42.28 NVP w CD 11.88 13.01 14.06 22.89 39.36 NVP w KALE (DV) 11.6 12.77 13.26 26.56 46.48 NVP w KALE (F) 11.19 12.66 13.26 24.66 38.35 Table 3: UCI datasets: Negative log-likelihood computed on the test set and corresponding to the best performance on the validation set. Best method in boldface. 7 ACKNOWLEDGMENTS We thank Mihaela Rosca for insightful discussions and Song Liu, Bo Dai and Hanjun Dai for pointing us to important related work. Published as a conference paper at ICLR 2021 REFERENCES Arbel, M. and Gretton, A. (2018). Kernel Conditional Exponential Family. In International Conference on Artificial Intelligence and Statistics, pages 1337 1346. Arbel, M., Gretton, A., Li, W., and Montufar, G. (2019). Kernelized Wasserstein Natural Gradient. Arbel, M., Sutherland, D., Binkowski, M., and Gretton, A. (2018). On gradient regularizers for mmd gans. In Advances in Neural Information Processing Systems 31. Curran Associates, Inc. Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, International Convention Centre, Sydney, Australia. PMLR. Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. (2017). Generalization and equilibrium in generative adversarial nets (GANs). In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 224 232. PMLR. Azadi, S., Olsson, C., Darrell, T., Goodfellow, I., and Odena, A. (2019). Discriminator rejection sampling. In International Conference on Learning Representations. Belanger, D. and Mc Callum, A. (2016). Structured prediction energy networks. In International Conference on Machine Learning, pages 983 992. Betancourt, M., Byrne, S., Livingstone, S., and Girolami, M. (2017). The geometric foundations of Hamiltonian Monte Carlo. Bernoulli, 23(4A):2257 2298. Bi nkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. (2018). Demystifying MMD GANs. In International Conference on Learning Representations. Bottou, L., Arjovsky, M., Lopez-Paz, D., and Oquab, M. (2017). Geometrical insights for implicit generative modeling. In Braverman Readings in Machine Learning. Brehmer, J. and Cranmer, K. (2020). Flows for simultaneous manifold learning and density estimation. ar Xiv preprint ar Xiv:2003.13913. Brock, A., Donahue, J., and Simonyan, K. (2018). Large scale gan training for high fidelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096. Che, T., Zhang, R., Sohl-Dickstein, J., Larochelle, H., Paull, L., Cao, Y., and Bengio, Y. (2020). Your GAN is secretly an energy-based model and you should use discriminator driven latent sampling. Chen, T., Zhai, X., Ritter, M., Lucic, M., and Houlsby, N. (2019). Self-supervised gans via auxiliary rotation loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12154 12163. Cheng, X., Chatterji, N. S., Bartlett, P. L., and Jordan, M. I. (2017). Underdamped langevin mcmc: A non-asymptotic analysis. ar Xiv preprint ar Xiv:1707.03663. Chu, C., Minami, K., and Fukumizu, K. (2020). Smoothness and stability in gans. In International Conference on Learning Representations. Cornish, R., Caterini, A. L., Deligiannidis, G., and Doucet, A. (2020). Relaxing bijectivity constraints with continuously indexed normalising flows. Cranmer, K., Pavez, J., and Louppe, G. (2016). Approximating likelihood ratios with calibrated discriminative classifiers. Dai, B., Dai, H., Gretton, A., Song, L., Schuurmans, D., and He, N. (2019a). Kernel exponential family estimation via doubly dual embedding. In Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 2321 2330. PMLR. Dai, B., Liu, Z., Dai, H., He, N., Gretton, A., Song, L., and Schuurmans, D. (2019b). Exponential Family Estimation via Adversarial Dynamics Embedding. ar Xiv:1904.12083 [cs, stat]. ar Xiv: 1904.12083. Published as a conference paper at ICLR 2021 Davis, D. and Drusvyatskiy, D. (2018). Stochastic subgradient method converges at the rate $O(k {-1/4})$ on weakly convex functions. ar Xiv:1802.02988 [cs, math]. Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411 436. Deng, Y., Bakhtin, A., Ott, M., Szlam, A., and Ranzato, M. (2020). Residual energy-based models for text generation. ar Xiv preprint ar Xiv:2004.11714. Dheeru, D. and Taniskidou, E. K. (2017). Uci machine learning repository. Ding, X., Wang, Z. J., and Welch, W. J. (2019). Subsampling Generative Adversarial Networks: Density Ratio Estimation in Feature Space with Softplus Loss. Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2016). Density estimation using real nvp. Donahue, J. and Simonyan, K. (2019). Large Scale Adversarial Representation Learning. ar Xiv:1907.02544 [cs, stat]. ar Xiv: 1907.02544. Donsker, M. D. and Varadhan, S. R. S. (1975). Asymptotic evaluation of certain markov process expectations for large time, i. 28(1):1 47. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.3160280102. Doucet, A., Freitas, N. d., and Gordon, N. (2001). Sequential Monte Carlo Methods in Practice. Information Science and Statistics. Springer-Verlag, New York. Du, Y. and Mordatch, I. (2019). Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems 32, pages 3608 3618. Curran Associates, Inc. Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(Jul):2121 2159. Eberle, A., Guillin, A., and Zimmer, R. (2017). Couplings and quantitative contraction rates for Langevin dynamics. The Annals of Probability. Ekeland, I. and Témam, R. (1999). Convex Analysis and Variational Problems. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics. Feydy, J., Séjourné, T., Vialard, F.-X., Amari, S.-i., Trouvé, A., and Peyré, G. (2019). Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2681 2690. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672 2680. Curran Associates, Inc. Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D., Norouzi, M., and Swersky, K. (2020). Your classifier is secretly an energy based model and you should treat it like one. Grover, A., Song, J., Kapoor, A., Tran, K., Agarwal, A., Horvitz, E. J., and Ermon, S. (2019). Bias correction of learned generative models using likelihood-free importance weighting. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. (2017). Improved training of wasserstein gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA. Curran Associates Inc. Gutmann, M. U. and Hyvärinen, A. (2012). Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. The Journal of Machine Learning Research, 13(null):307 361. Haugh, M. (2017). Mcmc and bayesian modeling. IEOR E4703 Monte-Carlo Simulation, Columbia University. Published as a conference paper at ICLR 2021 Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30, pages 6626 6637. Curran Associates, Inc. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771 1800. Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In Advances in neural information processing systems, pages 4565 4573. Hyvärinen, A. (2005). Estimation of Non-Normalized Statistical Models by Score Matching. The Journal of Machine Learning Research, 6:695 709. Kanamori, T., Suzuki, T., and Sugiyama, M. (2011). f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models. IEEE Transactions on Information Theory, 58(2):708 720. Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic Optimization. ar Xiv:1412.6980 [cs]. ar Xiv: 1412.6980. Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. ICLR. Klenke, A. (2008). Probability Theory: A Comprehensive Course. World Publishing Corporation. Kodali, N., Abernethy, J., Hays, J., and Kira, Z. (2017). On Convergence and Stability of GANs. ar Xiv:1705.07215 [cs]. ar Xiv: 1705.07215. Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto. Lawson, J., Tucker, G., Dai, B., and Ranganath, R. (2019). Energy-inspired models: Learning with sampler-induced distributions. In Advances in Neural Information Processing Systems 32, pages 8501 8513. Curran Associates, Inc. Le Cun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, F.-J. (2006). Predicting Structured Data, chapter A Tutorial on Energy-Based Learning. MIT Press. Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Poczos, B. (2017). Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems 30, pages 2203 2213. Curran Associates, Inc. Liu, S., Bousquet, O., and Chaudhuri, K. (2017). Approximation and Convergence Properties of Generative Adversarial Learning. Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. Milgrom, P. and Segal, I. (2002). Envelope Theorems for Arbitrary Choice Sets. Econometrica, 70. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. In International Conference on Learning Representations. Nagarajan, V. and Kolter, J. Z. (2017). Gradient descent gan optimization is locally stable. Neal, R. M. (2010). Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo. Neklyudov, K., Egorov, E., and Vetrov, D. (2019). The implicit metropolis-hastings algorithm. Nguyen, T., Le, T., Vu, H., and Phung, D. (2017). Dual discriminator generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2670 2680. Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861. Published as a conference paper at ICLR 2021 Nowozin, S., Cseke, B., and Tomioka, R. (2016). f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems 29, pages 271 279. Curran Associates, Inc. Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. (2016). Pixel recurrent neural networks. ar Xiv preprint ar Xiv:1601.06759. Ostrovski, G., Dabney, W., and Munos, R. (2018). Autoregressive quantile networks for generative modeling. ar Xiv preprint ar Xiv:1806.05575. Papamakarios, G., Pavlakou, T., and Murray, I. (2017). Masked autoregressive flow for density estimation. NIPS. Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434. Raginsky, M., Rakhlin, A., and Telgarsky, M. (2017). Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. Retherford, J. R. (1978). Review: J. diestel and j. j. uhl, jr., vector measures. Bull. Amer. Math. Soc., 84(4):681 685. Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pages 1530 1538. JMLR.org. Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In ICML, pages 1278 1286. Rockafellar, R. T. (1970). Convex analysis. Princeton Mathematical Series. Princeton University Press, Princeton, N. J. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014). Image Net Large Scale Visual Recognition Challenge. ar Xiv:1409.0575 [cs]. ar Xiv: 1409.0575. Sachs, M., Leimkuhler, B., and Danos, V. (2017). Langevin Dynamics with Variable Coefficients and Nonconservative Forces: From Stationary States to Numerical Methods. Entropy, 19. Sanjabi, M., Ba, J., Razaviyayn, M., and Lee, J. D. (2018). On the convergence and robustness of training gans with regularized optimal transport. In Advances in Neural Information Processing Systems 31, pages 7091 7101. Curran Associates, Inc. Siegmund, D. (1976). Importance sampling in the monte carlo study of sequential tests. The Annals of Statistics, pages 673 684. Simon-Gabriel, C.-J. and Scholkopf, B. (2018). Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. Journal of Machine Learning Research, 19(44):1 29. Simsekli, U., Zhu, L., Teh, Y. W., and Gurbuzbalaban, M. (2020). Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise. ar Xiv:2002.05685 [cs, stat]. ar Xiv: 2002.05685. Sriperumbudur, B., Fukumizu, K., Kumar, R., Gretton, A., and Hyvärinen, A. (2017). Density estimation in infinite dimensional exponential families. Journal of Machine Learning Research. Sugiyama, M., Suzuki, T., and Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge University Press. Sutherland, D., Strathmann, H., Arbel, M., and Gretton, A. (2018). Efficient and principled score estimation with Nystrom kernel exponential families. In International Conference on Artificial Intelligence and Statistics, pages 652 660. Published as a conference paper at ICLR 2021 Tanaka, A. (2019). Discriminator optimal transport. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc. Thekumparampil, K. K., Jain, P., Netrapalli, P., and Oh, S. (2019). Efficient algorithms for smooth minimax optimization. In Advances in Neural Information Processing Systems 32, pages 12680 12691. Curran Associates, Inc. Thiry, L., Arbel, M., Belilovsky, E., and Oyallon, E. (2021). The unreasonable effectiveness of patches in deep convolutional kernels methods. In International Conference on Learning Representations. Tsuboi, Y., Kashima, H., Hido, S., Bickel, S., and Sugiyama, M. (2009). Direct density ratio estimation for large-scale covariate shift adaptation. Journal of Information Processing, 17:138 155. Tu, L. and Gimpel, K. (2018). Learning approximate inference networks for structured prediction. ar Xiv preprint ar Xiv:1803.03376. Turner, R., Hung, J., Frank, E., Saatchi, Y., and Yosinski, J. (2019). Metropolis-Hastings generative adversarial networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6345 6353, Long Beach, California, USA. PMLR. Villani, C. (2009). Optimal transport: Old and new. Technical report. Wenliang, L., Sutherland, D., Strathmann, H., and Gretton, A. (2019). Learning deep kernels for exponential family densities. In International Conference on Machine Learning, pages 6737 6746. Wu, Y., Donahue, J., Balduzzi, D., Simonyan, K., and Lillicrap, T. (2019a). LOGAN: Latent Optimisation for Generative Adversarial Networks. ar Xiv:1912.00953 [cs, stat]. ar Xiv: 1912.00953. Wu, Y., Rosca, M., and Lillicrap, T. (2019b). Deep compressed sensing. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6850 6860, Long Beach, California, USA. PMLR. Xie, J., Lu, Y., Gao, R., and Wu, Y. N. (2018a). Cooperative learning of energy-based model and latent variable model via mcmc teaching. In AAAI, volume 1, page 7. Xie, J., Lu, Y., Gao, R., Zhu, S.-C., and Wu, Y. N. (2018b). Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence, 42(1):27 45. Xie, J., Lu, Y., Zhu, S.-C., and Wu, Y. (2016). A theory of generative convnet. In International Conference on Machine Learning, pages 2635 2644. Xie, J., Zheng, Z., Gao, R., Wang, W., Zhu, S.-C., and Nian Wu, Y. (2018c). Learning descriptor networks for 3d shape synthesis and analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8629 8638. Xie, J., Zhu, S.-C., and Nian Wu, Y. (2017). Synthesizing dynamic patterns by spatial-temporal generative convnet. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 7093 7101. Xie, J., Zhu, S.-C., and Wu, Y. N. (2019). Learning energy-based spatial-temporal generative convnets for dynamic patterns. IEEE transactions on pattern analysis and machine intelligence. Xu, K., Du, C., Li, C., Zhu, J., and Zhang, B. (2018). Learning implicit generative models by teaching density estimators. ar Xiv preprint ar Xiv:1807.03870. Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and Xiao, J. (2015). LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. Yu, L., Song, Y., Song, J., and Ermon, S. (2020). Training deep energy-based models with f-divergence minimization. Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. Proceedings of machine learning research, 70:3987. Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. (2017). On the Discrimination-Generalization Tradeoff in GANs. ar Xiv:1711.02771 [cs, stat]. ar Xiv: 1711.02771. Published as a conference paper at ICLR 2021 A KL APPROXIMATE LOWER-BOUND ESTIMATE We discuss the relation between KALE (10) and the Kullback-Leibler divergence via Fenchel duality. Recall that a distribution P is said to admit a density w.r.t. G if there exists a real-valued measurable function r0 that is integrable w.r.t. G and satisfies d P = r0d G. Such a density is also called the Radon-Nikodym derivative of P w.r.t. G. In this case, we have: KL(P||G)= Z r0log(r0)d G. (15) Nguyen et al. (2010); Nowozin et al. (2016) derived a variational formulation for the KL using Fenchel duality. By the duality theorem (Rockafellar, 1970), the convex and lower semi-continuous function ζ :u7 ulog(u) that appears in (15) can be expressed as the supremum of a concave function: ζ(u)=sup v uv ζ (v). The function ζ is called the Fenchel dual and is defined as ζ (v)=supuuv ζ(u). By convention, the value of the objective is set to whenever u is outside of the domain of definition of ζ . When ζ(u) = ulog(u), the Fenchel dual ζ (v) admits a closed form expression of the form ζ (v)=exp(v 1). Using the expression of ζ in terms of its Fenchel dual ζ , it is possible to express KL(P||G) as the supremum of the variational objective (16) over all measurable functions h. F(h):= Z hd P Z exp( h)d G+1. (16) Nguyen et al. (2010) provided the variational formulation for the reverse KL using a different choice for ζ: (ζ(u) = log(u)). We refer to (Nowozin et al., 2016) for general f-divergences. Choosing a smaller set of functions H in the variational objective (16) will lead to a lower bound on the KL. This is the KL Approximate Lower-bound Estimate (KALE): KALE(P||G)= sup h H F(h) (17) In general, KL(P||G) KALE(P||G). The bound is tight whenever the negative log-density h0 = logr0 belongs to H; however, we do not require r0 to be well-defined in general. Equation (17) has the advantage that it can be estimated using samples from P and G. Given i.i.d. samples (X1,...,XN) and (Y1,...,YM) from P and G, we denote by ˆP and ˆG the corresponding empirical distributions. A simple approach to estimate KALE(P||G) is to use an M-estimator. This is achieved by optimizing the penalized objective ˆh:=argmax h H F(h) λ 2 I2(h), (18) where F is an empirical version of F and I2(h) is a penalty term that prevents overfitting due to finite samples. The penalty I2(h) acts as a regularizer favoring smoother solutions while the parameter λ determines the strength of the smoothing and is chosen to decrease as the sample size N and M increase. The M-estimator of KALE(P||G) is obtained simply by plugging in ˆh into the empirical objective F(h): KALE(P||G):= F(ˆh). (19) We defer the consistency analysis of (19) to Appendix B where we provide convergence rates in a setting where the set of functions H is a Reproducing Kernel Hilbert Space and under weaker assumptions that were not covered by the framework of Nguyen et al. (2010). B CONVERGENCE RATES OF KALE In this section, we provide a convergence rate for the estimator in (19) when H is an RKHS. The theory remains the same whether H contains constants or not. With this choice, the Representer Theorem allows us to reduce the potentially infinite-dimensional optimization problem in (18) to a convex finite-dimensional one. We further restrict ourselves to the well-specified case where the density r0 of P w.r.t. G is well-defined and belongs to H, so that KALE matches the KL. While Nguyen et al. Published as a conference paper at ICLR 2021 (2010) (Theorem 3) provides a convergence rate of 1/ N for a related M-estimator, this requires the density r0 to be lower-bounded by 0 as well as (generally) upper-bounded. This can be quite restrictive if, for instance, r0 is the density ratio of two gaussians. In Theorem 7, we provide a similar convergence rate for the estimator defined in (19) without requiring r0 to be bounded. We start by briefly introducing some notations, the working assumptions and the statement of the convergence result in Appendix B.1 and provide the proofs in Appendix B.2. B.1 STATEMENT OF THE RESULT We recall that an RKHS H of functions defined on a domain X Rd and with kernel k is a Hilbert space with dot product .,. , such that y7 k(x,y) belongs to H for any x X, and k(x,y)= k(x,.),k(y,.) , x,y X. Any function h in H satisfies the reproducing property f(x)= f,k(x,.) for any x X. Recall that KALE(P||G) is obtained as an optimization problem KALE(P||G)= sup h H F(h) (20) where F is given by: F(h):= Z hd P Z exp( h)d G+1. Since the negative log density ratio h0 is assumed to belong to H, this directly implies that the supremum of F is achieved at h0 and F(h0) = KALE(P||G). We are interested in estimating KALE(P||G) using the empirical distributions ˆP and ˆG, n=1 δXn, ˆG:= 1 where (Xn)1 n N and (Yn)1 n N are i.i.d. samples from P and G. For this purpose we introduce the empirical objective functional, F(h):= Z hdˆP Z exp( h)dˆG+1. The proposed estimator is obtained by solving a regularized empirical problem, sup h H F(h) λ 2 h 2, (21) with a corresponding population version, sup h H F(h) λ 2 h 2. (22) Finally, we introduce D(h,δ) and Γ(h,δ): D(h,δ)= Z δexp( h)d G Z δd P, Γ(h,δ)= Z Z 1 0 (1 t)δ2exp( (h+tδ))d G. The empirical versions of D(h,δ) and Γ(h,δ) are denoted ˆD(h,δ) and ˆΓ(h,δ). Later, we will show that D(h,δ) ˆD(h,δ) are in fact the gradients of F(h) and F(h) along the direction δ. We state now the working assumptions: (i) The supremum of F over H is attained at h0. Published as a conference paper at ICLR 2021 (ii) The following quantities are finite for some positive ϵ: Z k(x,x) d P(x), Z k(x,x)exp(( h0 +ϵ) k(x,x)) d G(x), Z k(x,x)exp(( h0 +ϵ) k(x,x)) d G(x). (iii) For any h H, if D(h,δ)=0 for all δ then h=h0. Theorem 7. Fix any 1 > η > 0. Under Assumptions (i) to (iii), and provided that λ = 1 N , it holds with probability at least 1 2η that | F(ˆh) F(h0)| M (η,h0) for a constant M (η,h0) that depends only on η and h0. The assumptions in Theorem 7 essentially state that the kernel associated to the RKHS H needs to satisfy some integrability requirements. That is to guarantee that the gradient δ7 F(h)(δ) and its empirical version are well-defined and continuous. In addition, the optimality condition F(h)=0 is assumed to characterize the global solution h0. This will be the case if the kernel is characteristic Simon-Gabriel and Scholkopf (2018). The proof of Theorem 7, in Appendix B.2, takes advantage of the Hilbert structure of the set H, the convexity of the functional F and the optimality condition F(ˆh)=λˆh of the regularized problem, all of which turn out to be sufficient for controlling the error of (19). We state now the proof of Theorem 7 with subsequent lemmas and propositions. Proof of Theorem 7. We begin with the following inequalities: λ 2 ( ˆh 2 h0 2) F(ˆh) F(h0) F(h0),ˆh h0 . The first inequality is by definition of ˆh while the second is obtained by concavity of F. For simplicity we write B= ˆh h0 and C = F(h0) L(h0) . Using Cauchy-Schwarz and triangular inequalities, it is easy to see that 2 B2+2B h0 F(ˆh) F(h0) CB. Moreover, by triangular inequality, it holds that B hλ h0 + ˆh hλ . Lemma 11 ensures that A(λ)= hλ h0 converges to 0 as λ 0. Furthermore, by Proposition 12, we have ˆh hλ 1 λD where D(λ)= F(hλ) L(hλ) . Now choosing λ= 1 N and applying Chebychev inequality in Lemma 8, it follows that for any 1>η>0, we have with probability greater than 1 2η that both D(λ) C( h0 η) N , C C( h0 ,η) where C( h0 ,η) is defined in Lemma 8. This allows to conclude that for any η > 0, it holds with probability at least 1 2η that | F(ˆh) F(h0)| M (η,h0) N where M (η,h0) depends only on η and h0. We proceed using the following lemma, which provides an expression for D(h,δ) and ˆD(h,δ) along with a probabilistic bound: Published as a conference paper at ICLR 2021 Lemma 8. Under Assumptions (i) and (ii), for any h H such that h h0 +ϵ, there exists D(h) in H satisfying D(h,δ)= δ,D(h) , and for any h H, there exists D(h) satisfying D(h,δ)= δ, D(h) . Moreover, for any 0<η <1 and any h H such that h h0 +ϵ:=M, it holds with probability greater than 1 η that D(h) D(h) C(M,η) where C(M,η) depends only on M and η. Proof. First, we show that δ 7 D(h,δ) is a bounded linear operator. Indeed, Assumption (ii) ensures that k(x,.) and k(x,.)exp( h(x)) are Bochner integrable w.r.t. P and G (Retherford (1978)), hence D(h,δ) is obtained as D(h,δ):= δ,µexp( h)G µP , where µexp( h)G = R k(x,.)exp( h(x))d G and µP = R k(x,.)d P. Defining D(h) to be =µexp( h)G µP leads to the desired result. D(h) is simply obtained by taking the empirical version of D(h). Finally, the probabilistic inequality is a simple consequence of Chebychev s inequality. The next lemma states that F(h) and F(h) are Frechet differentiable. Lemma 9. Under Assumptions (i) and (ii) , h 7 F(h) is Frechet differentiable on the open ball of radius h0 +ϵ while h 7 F(h) is Frechet differentiable on H. Their gradients are given by D(h) and D(h) as defined in Lemma 8, F(h)=D(h), F(h)= D(h) Proof. The empirical functional F(h) is differentiable since it is a finite sum of differentiable functions, and its gradient is simply given by D(h). For the population functional, we use second order Taylor expansion of exp with integral remainder, which gives F(h+δ)=F(h) D(h,δ)+Γ(h,δ). By Assumption (ii) we know that Γ(h,δ) δ converges to 0 as soon as δ 0. This allows to directly conclude that F is Frechet differentiable, with differential given by δ 7 D(h,δ). By Lemma 8, we conclude the existence of a gradient F(h) which is in fact given by F(h)=D(h). From now on, we will only use the notation F(h) and F(h) to refer to the gradients of F(h) and F(h). The following lemma states that (21) and (22) have a unique global optimum, and gives a first order optimality condition. Lemma 10. The problems (21) and (22) admit unique global solutions ˆh and hλ in H. Moreover, the following first order optimality conditions hold: λˆh= F(ˆh), λhλ = F(hλ). Proof. For (21), existence and uniqueness of a minimizer ˆh is a simple consequence of continuity and strong concavity of the regularized objective. We now show the existence result for (22). Let s introduce Gλ(h)= F(h)+ λ 2 h 2 for simplicity. Uniqueness is a consequence of the strong convexity of Gλ. For the existence, consider a sequence of elements fk H such that Gλ(fk) infh HGλ(h). If h0 Published as a conference paper at ICLR 2021 is not the global solution, then it must hold for k large enough that Gλ(fk) Gλ(h0). We also know that F(fk) F(h0), hence, it is easy to see that fk h0 for k large enough. This implies that fk is a bounded sequence, therefore it admits a weakly convergent sub-sequence by weak compactness. Without loss of generality we assume that fk weakly converges to some element hλ H and that fk h0 . Hence, hλ liminfk fk h0 . Recall now that by definition of weak convergence, we have fk(x) k hλ(x) for all x X. By Assumption (ii), we can apply the dominated convergence theorem to ensure that F(fk) F(hλ). Taking the limit of Gλfk, the following inequality holds: sup h H Gλ(h)=limsup k Gλ(fk) Gλ(hλ). Finally, by Lemma 9 we know that F is Frechet differentiable, hence we can use Ekeland and Témam (1999) (Proposition 2.1) to conclude that F(hλ) = λhλ. We use exactly the same arguments for (21). Next, we show that hλ converges towards h0 in H. Lemma 11. Under Assumptions (i) to (iii) it holds that: A(λ):= hλ h0 0. Proof. We will first prove that hλ converges weakly towards h0, and then conclude that it must also converge strongly. We start with the following inequalities: 0 F(hλ) F(h0) λ 2 ( hλ 2 h0 2). These are simple consequences of the definitions of hλ and h0 as optimal solutions to (20) and (21). This implies that hλ is always bounded by h0 . Consider now an arbitrary sequence (λm)m 0 converging to 0. Since hλm is bounded by h0 , it follows by weak-compactness of balls in H that hλm admits a weakly convergent sub-sequence. Without loss of generality we can assume that hλm is itself weakly converging towards an element h . We will show now that h must be equal to h0. Indeed, by optimality of hλm, it must hold that λmhλm = F(hm). This implies that F(hm) converges weakly to 0. On the other hand, by Assumption (ii), we can conclude that F(hm) must also converge weakly towards F(h ), hence F(h ) = 0. Finally by Assumption (iii) we know that h0 is the unique solution to the equation F(h)=0 , hence h =h0. We have shown so far that any subsequence of hλm that converges weakly, must converge weakly towards h0. This allows to conclude that hλm actually converges weakly towards h0. Moreover, we also have by definition of weak convergence that: h0 lim inf m hλm . Recalling now that hλm h0 it follows that hλm converges towards h0 . Hence, we have the following two properties: hλm converges weakly towards h0, hλm converges towards h0 . This allows to directly conclude that hλm h0 converges to 0. Proposition 12. We have that: λ ˆF(hλ) F(hλ) Proof. By definition of ˆh and hλ the following optimality conditions hold: λˆh= F(ˆh), λhλ = F(hλ). Published as a conference paper at ICLR 2021 We can then simply write: λ(ˆh hλ) ( F(ˆh) F(hλ))= F(hλ) F(hλ). Now introducing δ :=ˆh hλ and E := F(ˆh) F(hλ) for simplicity and taking the squared norm of the above equation, it follows that λ2 δ 2+ E 2 2λ δ,E = F(hλ) F(hλ) 2. By concavity of F on H we know that ˆh hλ,E 0. Therefore: λ2 ˆh hλ 2 F(hλ) F(hλ) 2. C LATENT NOISE SAMPLING AND SMOOTHNESS OF KALE C.1 LATENT SPACE SAMPLING Here we prove Proposition 6 for which we make the assumptions more precise: Assumption 1. We make the following assumption: logη is strongly concave and admits a Lipschitz gradient. There exists a non-negative constant L such that for any x,x X and z,z Z: |E(x) E(x )| x x , x E(x) x E(x ) x x |G(z) G(z )| z z , z G(z) z G(z ) z z Throughout this section, we introduce U(z):= log(η(z))+E(G(z)) for simplicity. Proof of Proposition 1 . To sample from QG,E, we first need to identify the posterior latent distribution νG,E used to produce those samples. We rely on (23) which holds by definition of QG,E for any test function h on X: Z h(x)d Q(x)= Z h(G(z))f(G(z))η(z)dz, (23) Hence, the posterior latent distribution is given by ν(z) = η(z)f(G(z)), and samples from GEBM are produced by first sampling from νG,E, then applying the implicit map G, X Q X =G(Z), Z ν. Proof of Proposition 2. the base distribution G admits a density on the whole space denoted by exp( r(x)) and the energy E is of the form E(x) = E(x) r(x) for some parametric function E, it is easy to see that Q has a density proportional to exp( E) and is therefore equivalent to a standard EBM with energy E. The converse holds as well, meaning that for any EBM with energy E, it is possible to construct a GEBM using an importance weighting strategy. This is achieved by first choosing a base G, which is required to have an explicit density exp( r) up to a normalizing constant, then defining the energy of the GEBM to be E(x)=E(x) r(x) so that: d Q(x) exp( E(x))d Gθ(x) exp( E(x))dx (24) Equation (24) effectively depends only on E(x) and not on G since the factor exp(r) exactly compensates for the density of G. The requirement that the base also admits a tractable implicit map G can be met by choosing G to be a normalizing flow (Rezende and Mohamed, 2015) and does not restrict the class of possible EBMs that can be expressed as GEBMs. Published as a conference paper at ICLR 2021 Proof of Proposition 6. Let πt be the probability distribution of (zt,vt) at time t of the diffusion in (14), which we recall that dzt =vtdt, dvt = (γvt+u U(zt))+ We call π its corresponding invariant distribution given by π (z,v) exp Å U(z) 1 By Lemma 13 we know that U is dissipative, bounded from below, and has a Lipschitz gradient. This allows to directly apply (Eberle et al., 2017)(Corollary 2.6.) which implies that W2(πt,π ) Cexp( tc), where c is a positive constant and C only depends on π and the initial distribution π0. Moreover, the constant c is given explicitly in (Eberle et al., 2017, Theorem 2.3) and is of order 0(e q) where q is the dimension of the latent space Z. We now consider an optimal coupling Πt between πt and π0. Given joints samples ((zt,vt),(z,v)) from Πt, we consider the following samples in input space (xt,x) := (G(zt),G(z)). Since zt and z have marginals πt and π , it is easy to see that xt Pt and x Q. Therefore, by definition of the W2 distance, we have the following bound: W 2 2 (Pt,Q) E xt x 2 Z G(zt) G(z) 2dΠt(zt,z) L2 Z zt z 2dΠt(zt,z) L2W 2 2 (πt,π ) C2L2exp( 2tc). The second line uses the definition of (xt,x) as joint samples obtained by mapping (zt,z). The third line uses the assumption that B is L-Lipschitz. Finally, the last line uses that Πt is an optimal coupling between πt and π . Lemma 13. Under Assumption 1, there exists A>0 and λ (0, 1 4] such that 1 2z t U(z) λ Å U(z)+ γ2 4u z 2 ã A, z Z, (25) where γ and u are the coefficients appearing in (14). Moreover, U is bounded bellow and has a Lipschitz gradient. Proof. For simplicity, let s call u(z) = log η(z), w(z) = E Bθ (z), and denote by M an upper-bound on the Lipschitz constant of w and w which is guaranteed to be finite by assumption. Hence U(z)=u(z)+w(z). Equation (25) is equivalent to having z u(z) 2λu(z) γ2 2u z 2 2λw(z) z w(z) 2A. (26) Using that w is Lipschitz, we have that w(z) w(0) + M z and z w(z) M z . Hence, 2λw(z) z w(z) 2A 2λw(0)+(2λ+1)M z 2A. Therefore, a sufficient condition for (26) to hold is z u(z) 2λu(z) γ2 2u z 2 +(2λ+1)M z 2A+2λw(0). (27) We will now rely on the strong convexity of u, which holds by assumption, and implies the existence of a positive constant m>0 such that u(z) u(0) z u(z)+ m z u(z) z u(0) +m z 2. Published as a conference paper at ICLR 2021 This allows to write the following inequality, z u(z) 2λu(z) γ2 2u (1 2λ)z u(z)+λ(m+ γ2 2u) z 2 2λu(0) 2u)) z 2 (1 2λ) z u(0) 2λu(0). Combining the previous inequality with (27) and denoting M = u(0) , it is sufficient to find A and λ satisfying Å 1 λ Å m+ γ2 ãã z 2 (M +M +2λ(M M )) z 2λ(u(0)+w(0))+2A 0. The l.h.s. in the above equation is a quadratic function in z and admits a global minimum when 2u ä 1 . The global minimum is always positive provided that A is large enough. To see that U is bounded below, it suffice to note, by Lipschitzness of w, that w(z) w(0) M z and by strong convexity of u that u(z) u(0)+M z + m Hence, U is lower-bounded by a quadratic function in z with positive leading coefficient m 2 , hence it must be lower-bounded by a constant. Finally, by assumption, u and w have Lipschitz gradients, which directly implies that U has a Lipschitz gradient. Proof of Proposition 3. By assumption KL(P||G) < + , this implies that P admits a density w.r.t. G which we call r(x). As a result P admits also a density w.r.t. Q given by: Zexp(E (x))r(x). We can then compute the KL(P||Q) explicitly: KL(P||Q)=EP[E]+log(Z)+EP[log(r)] = LP,G(E )+KL(P||G). Since 0 belongs to E and by optimality of E , we know that LP,G(E ) LP,G(0)=0. The result then follows directly. C.2 TOPOLOGICAL AND SMOOTHNESS PROPERTIES OF KALE Topological properties of KALE. Denseness and smoothness of the energy class E are the key to guarantee that KALE is a reliable criterion for measuring convergence. We thus make the following assumptions on E: (A) For all E E, E E and there is CE > 0 such that c E E for 0 c CE. For any continuous function g, any compact support K in X and any precision ϵ > 0, there exists a finite linear combination of energies G=Pr i=1ai Ei such that supx K|f(x) G(x)| ϵ. (B) All energies E in E are Lipschitz in their input with the same Lipschitz constant L>0. Assumption (A) holds in particular when E contains feedforward networks with a given number of parameters. In fact networks with a single neuron are enough, as shown in (Zhang et al., 2017, Theorem 2.3). Assumption (B) holds when additional regularization of the energy is enforced during training by methods such as spectral normalization Miyato et al. (2018) or gradient penalty Gulrajani et al. (2017) as done in Section 6. Proposition 4 states the topological properties of KALE ensuring that it can be used as a criterion for weak convergence. A proof is given in Appendix C.2.1 and is a consequence of (Zhang et al., 2017, Theorem B.1). Proposition 14. Under Assumptions (A) and (B) it holds that: 1. KALE(P||G) 0 with KALE(P||G)=0 if and only if P=G. 2. KALE(P||Gn) 0 if and only if Gn P under the weak topology. Published as a conference paper at ICLR 2021 C.2.1 TOPOLOGICAL PROPERTIES OF KALE In this section we prove Proposition 4. We first start by recalling the required assumptions and make them more precise: Assumption 2. Assume the following holds: The set X is compact. For all E E, E E and there is CE > 0 such that c E E for 0 c CE. For any continuous function g, any compact support K in X and any precision ϵ > 0, there exists a finite linear combination of energies G=Pr i=1ai Ei such that |f(x) G(x)| ϵ on K. All energies E in E are Lipschitz in their input with the same Lipschitz constant L>0. For simplicity we consider the set H = E +R, i.e.: H is the set of functions h of the form h = E +c where E E and c R. In all what follows P1 is the set of probability distributions with finite first order moments. We consider the notion of weak convergence on P1 as defined in (Villani, 2009, Definition 6.8) which is equivalent to convergence in the Wasserstein-1 distance W1. Proof of Proposition 4 . We proceed by proving the separation properties (1st statement), then the metrization of the weak topology (2nd statement). Separation. We have by Assumption 2 that 0 E, hence by definition KALE(PP||G) FP,G(0)=0. On the other hand, whenever P=G, it holds that: FP,G(h)= Z (exp( h)+h 1)d P, h H. Moreover, by convexity of the exponential, we know that exp( x)+x 1 0 for all x R. Hence, FP,G(h) FP,G(0) = 0 for all h H. This directly implies that KALE(P|G) = 0. For the converse, we will use the same argument as in the proof of (Zhang et al., 2017, Theorem B.1). Assume that KALE(P|G)=0 and let h be in H. By Assumption 2, there exists Ch >0 such that ch H and we have: F(ch) KALE(P||G)=0. Now dividing by c and taking the limit to 0, it is easy to see that R hd P+ R hd G 0. Again, by Assumption 2, we also know that h H, hence, R hd P R hd G 0. This necessarily implies that R hd P R hd G=0 for all h H. By the density of H in the set continuous functions on compact sets, we can conclude that the equality holds for any continuous and bounded function, which in turn implies that P=G. Metrization of the weak topology. We first show that for any P and G with finite first moment, it holds that KALE(P|G) LW1(P,G), where W1(P,G) is the Wasserstein-1 distance between P and G. For any h H the following holds: F(h)= Z hd P Z exp( h)d G+1 = Z h(x)d G(x) h(x )d P(x ) Z (exp( h)+h 1) | {z } 0 Z h(x)d G(x) h(x )d P(x ) LW1(P,G) The first inequality results from the convexity of the exponential while the last one is a consequence of h being L-Lipschitz. This allows to conclude that KALE(P||G) LW1(P,G) after taking the supremum over all h H. Moreover, since W1 metrizes the weak convergence on P1 (Villani, 2009, Theorem 6.9), it holds that whenever a sequence Gn converges weakly towards P in P1 we also have W1(P,Gn) 0 and thus KALE(P||Gn) 0. The converse is a direct consequence of (Liu et al., 2017, Theorem 10) since by assumption X is compact. Published as a conference paper at ICLR 2021 Well-defined learning. Assume that for any ϵ > 0 and any h and h in E there exists f in 2E such that h+h f ϵ then there exists a constant C such that: KALE(P,Q) CKALE(P,G) This means that the proposed learning procedure which first finds the optimal energy E given a base G by maximum likelihood then minimizes KALE(P,G) ensures ends up minimizing the distance between the data end the generalized energy-based model Q. KALE(P,Q)=sup h E LP,QG(h) = KALE(P,G)+sup h E LP,G(h+E ) Let s choose ϵ = KALE(P,G) and let h 2E such that h+E f ϵ. We have by concavity of the function (α,β)7 LP,G(α(h+E f)+βf) we have that: LP,G(h+E ) 2LP,G(1 2f) LP,G(h+E f) By assumption, we have that h+E f ϵ, thus |LP,G(h+E f)| 2ϵ. Moreover, we have that LP,G( 1 2f) KALE(P,G) since 1 2f E. This ensures that: LP,G(h+E ) 3KALE(P,G). Finally, we have shown that: KALE(P,Q) 2KALE(P,G). Hence, minimizing KALE(P,G) directly minimizes KALE(P,Q). C.2.2 SMOOTHNESS PROPERTIES OF KALE We will now prove Theorem 5. We begin by stating the assumptions that will be used in this section: (I) E is parametrized by a compact set of parameters Ψ. (II) Functions in E are jointly continuous w.r.t. (ψ,x) and are L-lipschitz and L-smooth w.r.t. the input x: Eψ(x) Eψ(x ) Le x x , x Eψ(x) x Eψ(x ) Le x x . (III) (θ,z)7 Gθ(z) is jointly continuous in θ and z, with z 7 Gθ(z) uniformly Lipschitz w.r.t. z: Gθ(z) Gθ(z ) Lb z z , z,z Z,θ Θ. There exists non-negative functions a and b defined from Z to R such that θ 7 Gθ(z) are a-Lipschitz and b-smooth in the following sense: Gθ(z) Gθ (z) a(z) θ θ , θGθ(z) θGθ (z) b(z) θ θ . Moreover, a and b are integrable in the following sense: Z a(z)2exp(2Le Lb z )dη(z)< , Z exp(Le Lb z )dη(z)< , Z b(z)exp(Le Lb z )dη(z)< . Published as a conference paper at ICLR 2021 To simplify notation, we will denote by Lθ(f) the expected Gθ log-likelihood under P. In other words, Lθ(E):=LP,Gθ(E)= Z Ed P log Z exp( E)d Gθ. We also denote by p E,θ the density of the model w.r.t. Gθ, p E,θ = exp( E) ZGθ,E , ZGθ,E = Z exp( E)d Gθ. We write K(θ):=KALE(P||Gθ) to emphasize the dependence on θ. Proof of Theorem 5. To show that sub-gradient methods converge to local optima, we only need to show that K is Lipschitz continuous and weakly convex. This directly implies convergence to local optima for sub-gradient methods, according to Davis and Drusvyatskiy (2018); Thekumparampil et al. (2019). Lipschitz continuity ensures that K is differentiable for almost all θ Θ, and weak convexity simply means that there exits some positive constant C 0 such that θ 7 K(θ)+C θ 2 is convex. We now proceed to show these two properties. We will first prove that θ7 K(θ) is weakly convex in θ. By Lemma 15, we know that for any E E, the function θ 7 Lθ(E) is M-smooth for the same positive constant M. This directly implies that it is also weakly convex and the following inequality holds: Lθt(E) t Lθ(E)+(1 t)Lθ (E)+ M 2 t(1 t) θ θ 2. Taking the supremum w.r.t. E, it follows that K(θt) t K(θ)+(1 t)K(θ )+ M 2 t(1 t) θ θ 2. This means precisely that K is weakly convex in θ. To prove that K is Lipschitz, we will also use Lemma 15, which states that Lθ(E) is Lipschitz in θ uniformly on E. Hence, the following holds: Lθ(E) Lθ(E)+LC θ θ . Again, taking the supremum over E, it follows directly that K(θ) K(θ )+LC θ θ . We conclude that K is Lipschitz by exchanging the roles of θ and θ to get the other side of the inequality. Hence, by the Rademacher theorem, K is differentiable for almost all θ. We will now provide an expression for the gradient of K. By Lemma 16 we know that ψ7 Lθ(Eψ) is continuous and by Assumption (I) Ψ is compact. Therefore, the supremum sup E ELθ(E) is achieved for some function E θ. Moreover, we know by Lemma 15 that Lθ(E) is smooth uniformly on E, therefore the family ( θLθ(E))E E is equi-differentiable. We are in position to apply Milgrom and Segal (2002)(Theorem 3) which ensures that K(θ) admits left and right partial derivatives given by + e K(θ)= lim t>0 t 0 θLθ(E θ+te) e, e K(θ)= lim t<0 t 0 θLθ(E θ+te) e, (28) where e is a given direction in Rr. Moreover, the theorem also states that K(θ) is differentiable iff t7 E θ+te is continuous at t=0. Now, recalling that K(θ) is actually differentiable for almost all θ, it must hold that E θ+te t 0 E θ and + e K(θ)= e K(θ) for almost all θ. This implies that the two limits in (28) are actually equal to θLθ(E θ) e. The gradient of K, whenever defined, in therefore given by θK(θ)=Z 1 Gθ,E θ Z x E θ(Gθ(z)) θGθ(z)exp( E θ(Gθ(z)))η(z)dz. Published as a conference paper at ICLR 2021 Lemma 15. Under Assumptions (I) to (III), the functional Lθ(E) is Lipschitz and smooth in θ uniformly on E: |Lθ(E) Lθ (E)| LC θ θ , θLθ(E) θLθ (E)) 2CL(1+L) θ θ . Proof. By Lemma 16, we have that Lθ(E) is differentiable, and that θLθ(E):= Z ( x E Gθ) θGθ(p E,θ Gθ)dη. Lemma 16 ensures that θLθ(E) is bounded by some positive constant C that is independent from E and θ. This implies in particular that Lθ(E) is Lipschitz with a constant C. We will now show that it is also smooth. For this, we need to control the difference D:= θLθ(E) θLθ (E) . We have by triangular inequality: D Z x E Gθ x E Gθ θGθ (p E,θ Gθ)dη | {z } I + Z x E Gθ θGθ θGθ (p E,θ Gθ)dη | {z } II + Z x E Gθ θGθ |p E,θ Gθ p E,θ Gθ |dη | {z } III The first term can be upper-bounded using Le-smoothness of E and the fact that Gθ is Lipschitz in θ: I Le θ θ Z |a|2(p E,θ Gθ)dη The last inequality was obtained by Lemma 17. Similarly, using that θGθ is Lipschitz, it follows by Lemma 17 that II Le θ θ Z |b|(p E,θ Gθ)dη Finally, for the last term III, we first consider a path θt =tθ+(1 t)θ for t [0,1], and introduce the function s(t):=p E,θt Gθt. We will now control the difference p E,θ Gθ p E,θ Gθ , also equal to s(1) s(0). Using the fact that st is absolutely continuous we have that s(1) s(0)= R 1 0 s (t)dt. The derivative s (t) is simply given by s (t)=(θ θ ) (Mt Mt)s(t) where Mt =( x E Bθt) θGθt and Mt = R Mtp E,θt Gθtdη. Hence, s(1) s(0)=(θ θ ) Z 1 0 (Mt Mt)s(t)dt. We also know that Mt is upper-bounded by La(z), which implies III L2 e θ θ Z 1 ÇZ |a(z)|2s(t)(z)dη(z)+ ÅZ a(z)s(t)(z)dη(z) ã2å L2 e(C+C2) θ θ , where the last inequality is obtained using Lemma 17. This allows us to conclude that Lθ(E) is smooth for any E E and θ Θ. Published as a conference paper at ICLR 2021 Lemma 16. Under Assumptions (II) and (III), it holds that ψ 7 Lθ(Eψ) is continuous, and that θ7 Lθ(Eψ) is differentiable in θ with gradient given by θLθ(E):= Z ( x E Gθ) θGθ(p E,θ Gθ)dη. Moreover, the gradient is bounded uniformly in θ and E: ÅZ exp( Le Lb z )dη(z) ã 1Z a(z)exp(Le Lb z )dη(z). Proof. To show that ψ7 Lθ(Eψ) is continuous, we will use the dominated convergence theorem. We fix ψ0 in the interior of Ψ and consider a compact neighborhood W of ψ0. By assumption, we have that (ψ,x) 7 Eψ(x) and (ψ,z) 7 Eψ(Gθ(z)) are jointly continuous. Hence, |Eψ(0)| and |Eψ(Gθ(0))| are bounded on W by some constant C. Moreover, by Lipschitz continuity of x7 Eψ, we have |Eψ(x)| |Eψ(0)|+Le x C+Le x , exp( E(Gθ(z))) exp( E(Gθ(0)))exp(Le Lb z ) exp(C)exp(Le Lb z ). Recalling that P admits a first order moment and that by Assumption (III), exp(Le Lb z ) is integrable w.r.t. η, it follows by the dominated convergence theorem and by composition of continuous functions that ψ7 Lθ(Eψ) is continuous in ψ0. To show that θ 7 Lθ(Eψ) is differentiable in θ, we will use the differentiation lemma in (Klenke, 2008, Theorem 6.28). We first fix θ0 in the interior of Θ, and consider a compact neighborhood V of θ0. Since θ 7 |E(Gθ(0))| is continuous on the compact neighborhood V it admits a maximum value C; hence we have using Assumptions (II) and (III) that exp( E(Gθ(z))) exp( E(Gθ(0)))exp(Le Lb z ) exp(C)exp(Le Lb z ). Along with the integrability assumption in Assumption (III), this ensures that z 7 exp( E(Gθ(z))) is integrable w.r.t η for all θ in V . We also have that exp( E(Gθ(z))) is differentiable, with gradient given by θexp( E(Gθ(z)))= x E(Gθ(z)) θGθ(z)exp( E(Gθ(z))). Using that E is Lipschitz in its inputs and Gθ(z) is Lipschitz in θ, and combining with the previous inequality, it follows that θexp( E(Gθ(z))) exp(C)Lea(z)exp(Le Lb z ), where a(z) is the location dependent Lipschitz constant introduced in Assumption (III). The r.h.s. of the above inequality is integrable by Assumption (III) and is independent of θ on the neighborhood V . Thus (Klenke, 2008, Theorem 6.28) applies, and it follows that Z exp( E(Gθ0(z)))dη(z)= Z x E(Gθ0(z)) θGθ0(z)exp( E(Gθ0(z)))dη(z). We can now directly compute the gradient of Lθ(E), θLθ(E)= ÅZ exp( E(Gθ0))dη ã 1Z x E(Gθ0) θGθ0exp( E(Gθ0))dη. Since E and Gθ are Lipschitz in x and θ respectively, it follows that x E(Gθ0(z)) Le and θGθ0(z) a(z). Hence, we have Z a(z)(p E,θ Gθ(z))dη(z). Finally, Lemma 17 allows us to conclude that θLθ(E) is bounded by a positive constant C independently from θ and E. Lemma 17. Under Assumptions (II) and (III), there exists a constant C independent from θ and E such that Z ai(z)(p E,θ Gθ(z))dη(z)