# pseudospherical_contrastive_divergence__700433c3.pdf Pseudo-Spherical Contrastive Divergence Lantao Yu Computer Science Department Stanford University lantaoyu@cs.stanford.edu Jiaming Song Computer Science Department Stanford University tsong@cs.stanford.edu Yang Song Computer Science Department Stanford University yangsong@cs.stanford.edu Stefano Ermon Computer Science Department Stanford University ermon@cs.stanford.edu Energy-based models (EBMs) offer flexible distribution parametrization. However, due to the intractable partition function, they are typically trained via contrastive divergence for maximum likelihood estimation. In this paper, we propose pseudo-spherical contrastive divergence (PS-CD) to generalize maximum likelihood learning of EBMs. PS-CD is derived from the maximization of a family of strictly proper homogeneous scoring rules, which avoids the computation of the intractable partition function and provides a generalized family of learning objectives that include contrastive divergence as a special case. Moreover, PS-CD allows us to flexibly choose various learning objectives to train EBMs without additional computational cost or variational minimax optimization. Theoretical analysis on the proposed method and extensive experiments on both synthetic data and commonly used image datasets demonstrate the effectiveness and modeling flexibility of PS-CD, as well as its robustness to data contamination, thus showing its superiority over maximum likelihood and f-EBMs. 1 Introduction Energy-based models (EBMs) provide a unified framework for generative and discriminative learning by capturing dependencies between random variables with an energy function. Due to the absence of the normalization constraint, EBMs offer much more flexibility in distribution parametrization and architecture design compared to properly normalized probabilistic models such as autoregressive models [53, 23], flow-based models [14, 15, 46] and sum-product networks [72]. Recently, deep EBMs have achieved considerable success in realistic image generation [17, 66, 12, 30], molecular modeling [90] and model-based planning [16], thanks to modern deep neural networks [54, 49, 35] for parametrizing expressive energy functions and improved Markov Chain Monte Carlo (MCMC) techniques [62, 75, 40, 17, 66] for efficiently sampling from EBMs. Training EBMs consists of finding an energy function that assigns low energies to correct configurations of variables and high energies to incorrect ones [55], where a central concept is the loss functional that is used to measure the quality of the energy function and is minimized during training. The flexibility of EBMs does not come for free: it makes the design of loss functionals particularly challenging, as it usually involves the partition function that is generally intractable to compute. As a result, EBMs are typically trained via CD [37], which belongs to the analysis by synthesis scheme [31] and performs a sampling-based estimation of the gradient of KL between data distribution and energy-based distribution. Since different loss functionals will induce different solutions in practice (when the model is mis-specified and data is finite) and KL may not provide the right inductive 35th Conference on Neural Information Processing Systems (Neur IPS 2021). bias [25, 91], inspired by the great success of implicit generative models [28, 67, 3], [89] proposed a variational framework to train EBMs by minimizing general f-divergences [10]. Although this framework enables us to specify various modeling preferences such as diversity/quality tradeoff, they rely on learning additional components (variational functions) within a minimax framework, where the optimization is complicated by the notion of Nash equilibrium and local optimality [42] and suffers from instability and non-convergence issues [59]. Along this line, noise contrastive estimation (NCE) [34] can train EBMs with a family of loss functionals induced by different Bregman divergences. However, in practice, NCE usually relies on carefully-designed noise distribution such as context-dependent noise distribution [41] or joint learning of a flow-based noise distribution [21]. In this paper, we draw inspiration from statistical decision theory [13] and propose a novel perspective for designing loss functionals for training EBMs without involving auxiliary models or variational optimization. Specifically, to generalize maximum likelihood training of EBMs, we focus on maximizing pseudo-spherical scoring rules [76, 27], which are strictly proper such that the data distribution is the unique optimum and homogeneous such that they can be evaluated without the knowledge of the normalization constant. Under the analysis by synthesis scheme used in CD and f-EBM [89], we then derive a practical algorithm termed Pseudo-Spherical Contrastive Divergence (PS-CD). Different from f-EBM, PS-CD enables us to specify flexible modeling preferences without requiring additional computational cost or unstable minimax optimization. We provide a theoretical analysis on the sample complexity and convergence property of PS-CD, as well as its connections to maximum likelihood training. With experiments on both synthetic data and commonly used image datasets, we show that PS-CD achieves significant sample quality improvement over conventional maximum likelihood training and competitive performance to f-EBM without expensive variational optimization. Based on a set of recently proposed generative model evaluation metrics [61], we further demonstrate the various modeling tradeoffs enabled by PS-CD, justifying its modeling flexibility. Moreover, PS-CD is also much more robust than CD in face of data contamination. 2 Preliminaries 2.1 Energy-Based Distribution Representation and Sampling Given a set of i.i.d. samples {xi}N i=1 from some unknown data distribution p(x) defined over the sample space X Rm, the goal of generative modeling is to learn a θ-parametrized probability distribution qθ(x) to approximate the data distribution p(x). In the context of energy-based modeling, instead of directly parametrizing a properly normalized distribution, we first parametrize an unnormalized energy function Eθ : X R, which further defines a normalized probability density via the Boltzmann distribution: qθ(x) = qθ(x) Zθ = exp( Eθ(x)) where Zθ := R X exp( Eθ(x))dx is the partition function (normalization constant). In this paper, unless otherwise stated, we will use q to denote an unnormalized density and q to denote the corresponding normalized distribution. We also assume that the exponential of the negative energy belongs to the L1 space, E := Eθ : X R : R X exp( Eθ(x))dx < , i.e., Zθ is finite. Since energy-based models (EBMs) represent a probability distribution by assigning unnormalized scalar values (energies) to the data points, we can use any model architecture that outputs a bounded real number given an input to implement the energy function, which allows extreme flexibility in distribution parametrization. However, it is non-trivial to sample from an EBM, usually requiring MCMC [75] techniques. Specifically, in this work we consider using Langevin dynamics [62, 88], a gradient-based MCMC method that performs noisy gradient descent to traverse the energy landscape and arrive at the low-energy configurations: xt = xt 1 ϵ 2 x Eθ( xt 1) + ϵzt, (2) where zt N(0, I). The distribution of x T converges to the model distribution qθ(x) exp( Eθ(x)) when ϵ 0 and T under some regularity conditions [88]. In order to sample from an energy-based distribution efficiently, many scalable techniques have been proposed such as learning non-convergent, non-persistent, short-run MCMC [66] and using a sample replay buffer to improve mixing time and sample diversity [17]. In this work, we leverage these recent advances when we need to obtain samples from an EBM. 2.2 Maximum Likelihood Training of EBMs via Contrastive Divergence The predominant approach to training explicit density generative models is to approximately minimize the KL divergence between the (empirical) data distribution and model distribution. Minimizing KL divergence is equivalent to the following maximum likelihood estimation (MLE) objective: min θ LMLE(θ; p) = min θ Ep(x) [log qθ(x)] = min θ Ep(x)[Eθ(x)] + log Zθ. (3) Because of the intractable partition function (an integral over the sample space), we cannot directly optimize the above MLE objective. To tackle this issue, [37] proposed contrastive divergence (CD) algorithm as a convenient way to estimate the gradient of LMLE(θ; p) using samples from qθ: θLMLE(θ; p) = Ep(x)[ θEθ(x)] + θ log Zθ = Ep(x)[ θEθ(x)] Eqθ(x)[ θEθ(x)], (4) which can be interpreted as decreasing the energies of real data from p and increasing the energies of fake data generated by qθ. As discussed above, evaluating Equation (4) typically relies on MCMC methods such as the Langevin dynamics sampling procedure defined in Equation (2) to produce samples from the model distribution qθ, which induces a surrogate gradient estimation: θLCD K(θ; p) = Ep(x)[ θEθ(x)] Eq K θ (x)[ θEθ(x)], (5) where q K θ denotes the distribution after K steps of MCMC transitions from an initial distribution (typically data distribution or uniform distribution), and Equation (4) corresponds to LCD . 2.3 Strictly Proper Scoring Rules Stemming from statistical decision theory [13], scoring rules evaluate the quality of probabilistic forecasts by assigning numerical scores based on the predictive distributions and the events that materialize. Formally, consider a compact sample space X. Let M be a space of all locally 1integrable non-negative finite measures and P be a subspace consisting of all probability measures on the sample space X. A scoring rule S(x, q) specifies the utility of forecasting using a probability forecast q P for a given sample x X. With slightly abused notation, we write the expected score of S(x, q) under a reference distribution p as: S(p, q) := Ep(x)[S(x, q)]. (6) Definition 1 (Proper Scoring Rules [26]). A scoring rule S : X P R is called proper relative to P if the corresponding expected score satisfies: p, q P, S(p, q) S(p, p). (7) It is strictly proper if the equality holds if and only if q = p. In prediction and elicitation problems, strictly proper scoring rules encourage the forecaster to make honest predictions based on their true beliefs [22]. In estimation problems, where we want to approximate a distribution p with another parametric distribution qθ, strictly proper scoring rules provide attractive learning objectives: arg max qθ Pθ S(p, qθ) = arg max qθ Pθ Ep(x)[S(x, qθ)] = p (when p Pθ). (8) When a scoring rule S is strictly proper relative to P, the associated generalized entropy function and divergence function are defined as: G(p) := sup q P S(p, q) = S(p, p), D(p, q) := S(p, p) S(p, q) 0. (9) G(p) is convex and represents the maximally achievable utility, while D(p, q) is the Bregman divergence [6] associated with the convex function G and the equality holds only when p = q. Next, we introduce a specific kind of scoring rules that are particularly suitable for learning unnormalized statistical models. Definition 2 (Homogeneous Scoring Rules [69]). A scoring rule is homogeneous if it satisfies (here the domain of the score function is extended to X M): λ > 0, x X, S(x, q) = S(x, λ q). (10) Since scaling the model distribution q by a positive constant λ does not change the value of a homogeneous scoring rule, such homogeneity allows us to evaluate it without computing the intractable partition function of an energy-based distribution. Thus, strictly proper and homogeneous scoring rules are natural candidates for new training objectives of EBMs. Example 1. A notable example of scoring rules is the widely used logarithm score: S(x, q) = log q(x). The associated generalized entropy is the negative Shannon entropy: G(p) = Ep(x)[log p(x)], and the associated Bregman divergence is the KL divergence: D(p, q) = Ep(x) [log(p(x)/q(x))]. From Definitions 1 and 2, we know that the logarithm score is strictly proper but not homogeneous. Specifically, for a θ-parametrized energy-based distribution qθ = qθ/Zθ, since S(x, qθ) = S(x, Zθ qθ) = S(x, qθ) + log Zθ = S(x, qθ) and log Zθ cannot be ignored during the optimization of θ, we need to use tailored methods such as contrastive divergence [37] or doubly dual embedding [11] to tackle the intractable partition function. 3 Training EBMs by Maximizing Homogeneous Scoring Rules In this section, we derive a new principle for training EBMs from the perspective of optimizing strictly proper homogeneous scoring rules. All proofs for this section can be found in Appendix B. 3.1 Pseudo-Spherical Scoring Rule In this section, we introduce the pseudo-spherical scoring rule, which is a representative family of strictly proper homogeneous scoring rules that have great potentials for training deep energy-based models and allow flexible and convenient specification of modeling preferences, yet have not been explored before in the context of energy-based generative modeling. Definition 3 (Pseudo-Spherical Scoring Rule [76, 27]). For γ > 0, the pseudo-spherical scoring rule is defined as: S(x, q) := q(x)γ X q(y)γ+1dy) γ γ+1 = q(x)γ X q(y)γ+1dy) γ γ+1 = q(x) where q γ+1 := R X q(y)γ+1dy 1 γ+1 . The expected pseudo-spherical score under a reference distribution p is defined as: Sps(p, q) := Ep(x)[S(x, q)] = Ep(x)[q(x)γ] X q(y)γ+1dy) Example 2. The classic spherical scoring rule [19] is a special case in the pseudo-spherical family, which corresponds to γ = 1: S(x, q) = q(x) X q(y)2dy) 1 2 = q(x) The family of pseudo-spherical scoring rules is appealing because it introduces a different and principled way for assessing a probability forecast. For example, the spherical scoring rule has an interesting geometric interpretation. Suppose the sample space X contains n mutually exclusive and exhaustive outcomes. Then a probability forecast can be represented as a vector q = (q1, . . . , qn). Let vector p = (p1, . . . , pn) represent the oracle probability forecast. The expected spherical score can be written as: S(p, q) = Ep(x)[S(x, q)] = P i piqi p P i q2 i = p 2 p, q p 2 q 2 = p 2 cos( (p, q)) (14) where p, q and (p, q) denote the inner product and the angle between vectors p and q respectively. In other words, when we want to evaluate the expected spherical score of a probability forecast q under real data distribution p using samples, the angle between p and q is the sufficient statistics. Since we know that both p and q belong to the probability simplex P = {v| P x X v(x) = 1 and x X, v(x) 0.}, the expected score will be minimized if and only if the angle is zero, which implies p = q. More importantly, since all we need to do is to minimize the angle of the deviation, we are allowed to scale q by a constant. Specifically, when q is an energy-based distribution q = exp( E1) P i exp( Ei), . . . , exp( En) P i exp( Ei) , we can instead evaluate and minimize the angle between data distribution p and the unnormalized distribution q = (exp( E1), . . . , exp( En)), since (p, q) = (p, q). More generally, we have the following theorem to justify the use of pseudo-spherical scoring rules for training energy-based models: Theorem 1 ([26, 70]). Pseudo-spherical scoring rule is strictly proper and homogeneous. As the original definition of pseudo-spherical scoring rule (Equation (11)) takes the form of a fraction, for computational considerations, in this paper we instead focus on optimizing its composite scoring rule (Definition 2.1 in [44]): Definition 4 (γ-score [20]). For the expected pseudo-spherical score Sps(p, q) defined in Equation (12) with γ > 0, the expected γ-score is defined as: Sγ(p, q) := 1 γ log(Sps(p, q)) = 1 γ log Ep(x)[q(x)γ] log( q γ+1) (15) γ log(u) is strictly increasing in u, Sγ(p, q) is a strictly proper homogeneous composite score: arg max q P Sγ(p, q) = arg max q P 1 γ log(Sps(p, q)) = arg max q P Sps(p, q) = p. (16) 3.2 Pseudo-Spherical Contrastive Divergence Suppose we parametrize the energy-based model distribution as qθ qθ = exp( Eθ) and we want to minimize the negative γ-score in Equation (15): min θ Lγ(θ; p) = min θ 1 γ log Ep(x)[qθ(x)γ] + log( qθ γ+1) (17) In the following theorem, we derive the gradient of Lγ(θ; p) with respect to θ: Theorem 2. The gradient of Lγ(θ; p) with respect to θ can be written as: θLγ(θ; p) = 1 γ θ log Ep(x)[exp( γEθ(x))] Erθ(x)[ θEθ(x)] (18) where the auxiliary distribution rθ is also an energy-based distribution defined as: rθ(x) := qθ(x)γ+1 R X qθ(x)γ+1dx = exp( (γ + 1)Eθ(x)) R X exp( (γ + 1)Eθ(x))dx. In App. B.1, we provide two different ways to prove the above theorem. The first one is more straightforward and directly differentiates through the term log( qθ γ+1). The second one leverages a variational representation of log( qθ γ+1), where the optimal variational distribution happens to take an analytical form of r θ(x) qθ(x)γ+1, thus avoiding the minimax optimization in other variational frameworks such as [89, 11, 12]. The main challenge in maximizing γ-score is that it is generally intractable to exactly compute the gradient of the second term in Equation (15). During training, estimating the second term of Equation (18) requires us to obtain samples from the auxiliary distribution rθ exp( (γ + 1)Eθ), while at test time, we want to sample from the model distribution qθ exp( Eθ) that approximates the data distribution. Due to the restrict regularity conditions on the convergence of Langevin dynamics, in practice, we found it challenging to use the iterative sampling process in Equation (2) with a fixed number of transition steps and step size to produce samples from rθ and qθ simultaneously, as the temperature γ + 1 in rθ simply amounts to a linear rescaling to the energy function during training. Thus for generality, as in contrastive divergence [37, 17, 66] and f-EBM [89], we make the minimal assumption that we only have a sampling procedure to produce samples from qθ for both learning and inference procedures. In this case, we can leverage the analytical form of rθ and self-normalized importance sampling [68] (which has been used to derive gradient estimators in other contexts such as importance weighted autoencoder [7, 18]) to obtain a consistent estimation of Equation (18): Algorithm 1 Pseudo-Spherical Contrastive Divergence. 1: Input: Empirical data distribution pdata. Pseudo-spherical scoring rule hyperparameter γ. 2: Initialize energy function Eθ. 3: repeat 4: Draw a minibatch of samples {x+ 1 , . . . , x+ N} from pdata. 5: Draw a minibatch of samples {x 1 , . . . , x N} from qθ exp( Eθ) (e.g., using Langevin dynamics with a sample replay buffer). 6: Update the energy function by stochastic gradient descent: \ θLN γ (θ; p) = θ 1 i=1 exp( γEθ(x+ i )) PN i=1 exp( γEθ(x i )) θEθ(x i ) PN i=1 exp( γEθ(x i )) 7: until Convergence Theorem 3. Let x+ 1 , . . . , x+ N be i.i.d. samples from p(x) and x 1 , . . . , x N be i.i.d. samples from qθ(x) exp( Eθ(x)). Define the gradient estimator as: \ θLN γ (θ; p) := θ 1 γ log i=1 exp( γEθ(x+ i )) PN i=1 ωθ(x i ) θEθ(x i ) PN i=1 ωθ(x i ) (19) where the self-normalized importance weight ωθ(x i ) := rθ(x i )/qθ(x i ) = exp( γEθ(x i )). Then the gradient estimator converges to the true gradient (Equation (18)) in probability, i.e., ϵ > 0: lim N P \ θLN γ (θ; p) θLγ(θ; p) ϵ = 0. We summarize the pseudo-spherical contrastive divergence (PS-CD) training procedure in Algorithm 1. In Appendix A, we also provide a simple Py Torch implementation for stochastic gradient descent (SGD) with the gradient estimator in Equation (19). 3.3 Connections to Maximum Likelihood Estimation and Extension to γ < 0 From Equation (9) in Section 2.3, we know that γ-score induces the following Bregman divergence (the divergence function associated with proper composite scoring rule is analogously defined in Def. 2.1 in [44]): Dγ(p, qθ) = Sγ(p, p) Sγ(p, qθ) and maximizing γ-score is equivalent to minimizing Dγ(p, qθ). In the following lemma, we show that when γ 0, Dγ(p, qθ) will recover the KL divergence between p and qθ, and the gradient of PS-CD will recover the gradient of CD. Lemma 1. When γ 0, we have: lim γ 0 Dγ(p, qθ) = DKL(p qθ); lim γ 0 θLγ(θ; p) = θLMLE(θ; p). Inspired by [86, 56] that generalize Rényi divergence beyond its definition to negative orders, we now consider the extension of γ-score with γ < 0 (although it may not be strictly proper for these γ values). The following lemma shows that maximizing such scoring rule amounts to maximizing a lower bound of logarithm score (MLE) with an additional Rényi entropy regularization. Lemma 2. When 1 γ < 0, we have: Sγ(p, q) Ep(x)[log q(x)] + γ γ + 1Hγ+1(q) where Hγ+1(q) is the Rényi entropy of order γ + 1. 4 Theoretical Analysis In this section, to gain a deeper understanding of our PS-CD algorithm and how the proposed estimator behaves, we analyze its sample complexity and convergence property under certain conditions. All the proofs for this section can be found in Appendix C. 4.1 Sample Complexity We start with analyzing the sample complexity of the consistent gradient estimator in Equation (19), that is how fast it approaches the true gradient value. We first make the following assumption: Assumption 1. The energy function is bounded by K and the gradient is bounded by L: x X, θ Θ, |Eθ(x)| K, θEθ(x) L. This assumption is typically easy to satisfy because in practice we always use a bounded sample space (e.g. normalizing images to [0,1] or truncated Gaussian) to ensure stability. For example, in image modeling experiments, we use L2 regularization on the outputs of the energy function (hence bounded energy values), as well as normalized inputs and spectral normalization [60] for the neural network that realizes the energy function (hence Lipschitz continuous with bounded gradient). With vector Bernstein inequality [47, 32], we have the following theorem showing a sample complexity of O log(1/δ) ϵ2 such that the estimation error is less than ϵ with probability at least 1 δ: Theorem 4. For any constants ϵ > 0 and δ (0, 1), when the number of samples N satisfies: N 32L2e8γK (1 + 4 log(2/δ)) P \ θLN γ (θ; p) θLγ(θ; p) ϵ 1 δ. 4.2 Convergence Typically, convergence of SGD are analyzed for unbiased gradient estimators, while the gradient estimator in PS-CD is asymptotically consistent but biased. Building on the sample complexity bound above and prior theoretical works for analyzing SGD [8, 24], we analyze the convergence of PS-CD. For notational convenience, we use L(θ) to denote the loss function Lγ(θ; p) = Sγ(p, qθ). Besides Assumption 1, we further make the following assumption on the smoothness of L(θ): Assumption 2. The loss function L(θ) is M-smooth (with M > 0): θ1, θ2 Θ, L(θ1) L(θ2) M θ1 θ2 . This is a common assumption for analyzing first-order optimization methods, which is also used in [24, 8]. Also this is a relatively mild assumption since we do not require the loss function to be convex in θ. Since in non-convex optimization, the convergence criterion is typically measured by gradient norm, following [64, 24], we use L(θ) ξ to judge whether a solution θ is approximately a stationary point. Theorem 5. For any constants α (0, 1) and δ (0, 1), suppose that the step sizes satisfy ηt < 2(1 α)/M and the sample size Nt used for estimating bgt is sufficiently large (satisfying Equation (36)). Let L denote the minimum value of L(θ). Then with probability at least 1 δ, the output of Algorithm 2 (in Appendix C.2), bθ, satisfies (constant C := αML2e4γK): E[ L(bθ) 2] < 2(L(θ1) L ) + 12C PT t=1 η2 t PT t=1(2(1 α)ηt Mη2 t ) The above theorem implies the following corollary, which shows a typical convergence rate of O(1/ T) for non-convex optimization problems: Corollary 1. Under the conditions in Theorem 5 except that we use constant step sizes: ηt = min{(1 α)/M, 1/ T} for t = 1, . . . , T. Then with probability at least 1 δ, we have (constant C := αML2e4γK): E[ L(bθ) 2] < 2M(L(θ1) L ) (1 α)2T + 2(L(θ1) L ) + 12C In Appendix C.2, we discuss more on the strongly convex (Theorem 7) and convex cases (Theorem 8). 5 Related Work Direct KL Minimization. Under the analysis by synthesis scheme, [37] proposed Contrastive Divergence (CD), which estimates the gradient of the log-partition function (arising from KL) using samples from some MCMC procedure. To improve the mixing time of MCMC, [17] proposed to employ Persistent CD and a replay buffer to store intermediate samples from Markov chains throughout training, and [66] proposed to learn non-convergent short-run MCMC. Both approaches (long-run and short-run MCMC) work well with PS-CD in our experiments. PS-CD may also benefit from recent advances on unbiased MCMC [40, 73], which we leave as interesting future work. Fenchel Duality. By exploiting the primal-dual view of KL, recent works [11, 12, 2] proposed to cast maximum likelihood training of EBMs as minimax problems, which introduce a dual sampler and are approximately solved by alternating gradient descent ascent updates. Along this line, to allow flexible modeling preferences, [89] proposed f-EBM to enable the use of any f-divergence to train EBMs, which also relies on minimax optimization. By contrast, in this work, we leverage the analytical form of the optimal variational distribution and self-normalized importance sampling to reach a framework that requires no adversarial training and has no additional computational cost compared to CD while allowing flexible modeling preferences. Besides convenient optimization, PS-CD and f-EBM trains EBMs with two different families of divergences (hence complementary) with KL being the only shared one, since any pseudo-spherical scoring rule corresponds to a Bregman divergence (Section 2.3) and the only member in f-divergence that is also Bregman divergence is α-divergence (with KL as special case) (Theorem 4 in [1]). Homogeneous Scoring Rules. [84] proposed to learn unnormalized statistical models on discrete sample space by maximizing γ-score, which uses empirical data distribution (bp(x) = nx/n, where nx is the number of appearance of x in the dataset and n is the total number of data) as a surrogate to the real data density p and relies on a localization trick to bypass the computation of qθ γ+1. Consequently, it is only amenable to finite discrete sample space such as natural language [51], whereas PS-CD is applicable to any unnormalized probabilistic model in continuous domains. Another popular homogeneous scoring rule is the Hyvärinen score, which gives rise to the score matching objective [39] for EBM training. However, score matching and its variants [87, 82] have difficulties in low data density regions and do not perform well in practice when training EBMs on high-dimensional datasets [80]. Moreover, since the score matching objective involves the Hessian of log-density functions that is generally expensive to compute [58], methods such as approximate propagation [45], curvature propagation [58] and sliced score matching [83] are needed to approximately compute the trace of the Hessian. Noise Contrastive Estimation. Another principle for learning EBMs is Noise Contrastive Estimation (NCE) [34], where an EBM is learned by contrasting a prescribed noise distribution with tractable density against the unknown data distribution. Using various Bregman divergences, NCE can be generalized to a family of different loss functionals [33, 85]. However, finding an appropriate noise distribution for NCE is highly non-trivial. In practice, NCE typically works well in conjunction with a carefully-designed noise distribution such as context-dependent noise distribution [41] or joint learning of a flow-based noise distribution [21]. In this work, we focus on generalizing maximum likelihood by deriving novel training objectives for EBMs without involving auxiliary models (e.g., the variational function in [89], the flow-based noise distribution in [21] and the amortized sampler in [50, 12, 29]). 6 Experiments In this section, we demonstrate the effectiveness of PS-CD on several 1-D and 2-D synthetic datasets as well as commonly used image datasets. Setup. The 2-D synthetic datasets include Cosine, Swiss Roll, Moon, Mixture of Gaussian, Funnel and Rings, which cover different modalities and geometries (see Figure 2 in App. D.1 for illustration). To test the practical usefulness, we use MNIST [54], CIFAR-10 [48] and Celeb A [57] in our experiments for modeling natural images. Following [80], for Celeb A, we first center-crop the images to 140 140, then resize them to 64 64. More experimental details about the data processing, model architectures, sampling strategies and additional experimental results can be found in App. D. Real Data PS-CD (gamma=-0.5) KL (gamma=0) PS-CD (gamma=0.5) PS-CD (gamma=1.0) Figure 1: The effects of different γ values when fitting a mixture of Gaussian with a single Gaussian. Effects of Different γ Values. To illustrate the modeling flexibility brought by PS-CD and provide insights on the effects of different γ values, we first conduct a 1-D synthetic experiment similar to the one in [89]. As shown in Figure 1, when fitting a mixture of Gaussian with a single Gaussian (i.e., model mis-specification case), the family of PS-CD offers flexible tradeoffs between quality and diversity (i.e., mode collapse vs. mode coverage). Although in the well-specified case these objectives induce the same optimal solution, in this example, we can see that a larger γ leads to higher entropy. More importantly, compared to f-EBM [89] that also provides similar modeling flexibility and includes CD as a special case, PS-CD does not require expensive and unstable minimax optimization (no additional computational cost compared to CD). In App. D.2, we further visualize the objective landscapes for different γ values. As shown in Figure 3 and 4, when the model is well-specified, different objectives will induce the same optimal solution since they are strictly proper; when the model is mis-specified (corresponding to practical scenarios), different objectives will exhibit different modeling preferences. Furthermore, to better demonstrate the modeling flexibility brought by PS-CD in high-dimensional case, we conduct experiments on CIFAR-10 using a set of indicative and reliable metrics (Density, Coverage, Precision, Recall) proposed by [61] to evaluate the effects of γ from various perspectives. Please refer to App. D.3 for experimental results (Table 4) and detailed discussions. Table 1: FID scores for CIFAR-10 conditional and Celeb A unconditional image generation. We list comparisons with results reported by CD [17], Noise-Conditional Score Network (NCSN) [81] and f-EBMs [89]. γ = 1.0 corresponds to maximizing spherical scoring rule (Example 2). CIFAR-10 (32 32) Conditional Contrastive Divergence (KL) 37.90 f-EBM (KL) 37.36 f-EBM (Reverse KL) 33.25 f-EBM (Squared Hellinger) 32.19 f-EBM (Jensen Shannon) 30.86 Pseudo-Spherical CD (γ = 2.0) 33.19 Pseudo-Spherical CD (γ = 1.0) 29.78 Pseudo-Spherical CD (γ = 0.5) 35.02 Pseudo-Spherical CD (γ = 0.5) 27.95 Celeb A (64 64) Contrastive Divergence (KL) 26.10 NCSN (w/o denoising) 26.89 NCSN (w/ denoising) 25.30 NCSNv2 (w/o denoising) 28.86 NCSNv2 (w/ denoising) 10.23 Pseudo-Spherical CD (γ = 1.0) 24.76 Pseudo-Spherical CD (γ = 0.5) 20.35 2-D Synthetic Data. For quantitative evaluation of the 2-D synthetic data experiments, we follow [79] and report the maximum mean discrepancy (MMD, [5]) between the generated samples and validation samples in Table 3 in App. D.1, which demonstrates that PS-CD outperforms its CD counterpart on all but the Funnel dataset. From the histograms of samples shown in Figure 2 in App. D.1, we can also have similar observations. For example, CD fails to place high densities in the center of the right mode in Mo G, while PS-CD places the modes correctly. Image Generation. In Figure 5 in App.D.4, we show MNIST, CIFAR-10 and Celeb A samples produced by PS-CD (with γ = 1.0), which demonstrate that our approach can produce highly realistic images with simple model architectures. As suggested in [17], we use Fréchet Inception Distance (FID) [36] as the quantitative evaluation metric for CIFAR-10 and Celeb A, as Langevin dynamics converge to local minima that artificially inflate Inception Score [77]. From Table 11, we can see that various members (different γ values) in the family of PS-CD can outperform CD significantly, and more surprisingly, PS-CD also shows competitive performance to the recently proposed f-EBMs, without requiring expensive minimax optimization. While our method currently does not outperform the stateof-the-art image generation methods such as improved denoising score matching [81], which relies on carefully selected noise schedule and specially designed noise-dependent score network (modified U-Net architecture, hence not directly comparable to our results), we think that our work opens up a new research direction by bridging statistical decision theory (homogeneous proper scoring rules) and deep energy-based generative modeling. Moreover, 1For Celeb A dataset, we reproduced the short-run MCMC method [66] using our code base. Moreover, the f-EBM paper only reported results on Celeb A 32x32 and we empirically found it is not comparable to our method in Celeb A 64x64 (higher resolution), indicating better scalability of PS-CD to high-dimensional case. Table 2: Robustness to data contamination on Gaussian datasets. The data distribution is N( 1, 0.5) and the contamination distribution is N(2, 0.05). We measure the KL divergence between clean target distribution p and converged model distribution qθ, DKL(p qθ). Contamination Ratio CD PS-CD (γ = 0.5) PS-CD (γ = 1.0) PS-CD (γ = 2.0) 0.01 0.0067 1e-5 1e-7 1e-6 0.05 0.0851 0.00027 1.6e-6 0.00011 0.1 0.1979 0.00173 1.86e-6 0.00012 0.2 0.3869 0.1858 6.4e-6 0.00017 0.3 0.5438 0.5429 0.3118 0.00029 under the setting of simple model architectures and the same hyperparameter configuration (e.g., batch size, learning rate, network structure, etc.), our empirical results suggest clear superiority of PS-CD over traditional CD and recent f-EBMs. OOD Detection & Robustness to Data Contamination. We further test our methods on out-ofdistribution (OOD) detection tasks. For the conditional CIFAR-10 model, we follow the evaluation protocol in [17] and use s(x) = maxy Y E(x, y) as the score for detecting outliers. We use SVHN [65], Textures [9], Uniform/Gaussian Noise, CIFAR-10 Linear Interpolation and Celeb A as the OOD datasets. We summarize the results in Table 5 in App. D.6, from which we can see that PS-CD consistently outperforms CD and other likelihood-based models. Inspired by the OOD detection performance and previous work on robust parameter estimation under data contamination [43], we further test the robustness of CD and PS-CD on both synthetic and natural image datasets. Specifically, suppose p(x) is the underlying data distribution and there is another contamination distribution ω(x), e.g. uniform noise. In generative modeling under data contamination, our model observes i.i.d. samples from the contaminated distribution p(x) = cp(x) + (1 c)ω(x), where 1 c [0, 1/2) is the contamination ratio. A theoretical advantage of pseudo-spherical score is its robustness to data contamination: the optimal solution of minθ Sps( p, qθ) is close to that of minθ Sps(p, qθ) under some conditions (e.g. the density of ω(x) mostly lies in the region for which the target density p(x) is small) [20, 43]. From Table 2, we can see that CD suffers from data contamination severely: as the contamination ratio increases, the performance degrades drastically. By contrast, PS-CD shows good robustness against data contamination and a larger γ leads to better robustness. For example, PS-CD with γ = 1.0 can properly approximate the target distribution when the contamination ratio is 0.2, while PS-CD with γ = 2.0 can do so when the contamination ratio is up to 0.3. We conduct similar experiments on MNIST and CIFAR-10 datasets, where we use uniform noise as the contamination distribution and the contamination ratio is 0.1 (i.e. 10% images in the training set are replaced with random noise). After a warm-up pretraining2 (when the model has some OOD detection ability), we train the model with the contaminated data and measure the training progress. We observe that CD gradually generates more random noise and diverge after a few training steps, while PS-CD is very robust. As shown in Table 6 in App. D.6, for a slightly pre-trained unconditional CIFAR-10 model (a simple 5-layer CNN with FID of 68.77), we observe that the performance of CD degrades drastically in terms of FID, while PS-CD can continuously improve the model even using the contaminated data. We provide visualizations and theoretical explanations in App. D.6. 7 Conclusion From the perspective of maximizing strictly proper homogeneous scoring rules, we propose pseudospherical contrastive divergence (PS-CD) to generalize maximum likelihood estimation of energybased models. Different from prior works that involve joint training of auxiliary models [89, 21, 50, 12, 29], PS-CD allows us to specify flexible modeling preferences without additional computational cost compared to contrastive divergence. We provide a theoretical analysis on the sample complexity and convergence property of the proposed method, as well as its connection to maximum likelihood. Finally, we demonstrate the effectiveness of PS-CD with extensive experiments on both synthetic data and commonly used image datasets. 2Note that it is impossible for a randomly initialized model to be robust to data contamination since without additional inductive bias, it will simply treat the contaminated distribution p(x) as the target. Acknowledgements This research was supported by NSF(#1651565, #1522054, #1733686), ONR (N000141912145), AFOSR (FA95501910024), ARO (W911NF-21-1-0125) and Sloan Fellowship. [1] Shun-ichi Amari. Divergence function, information monotonicity and information geometry. In Workshop on information theoretic methods in science and engineering (WITMSE). Citeseer, 2009. [2] Michael Arbel, Liang Zhou, and Arthur Gretton. Kale: When energy-based learning meets adversarial training. ar Xiv preprint ar Xiv:2003.05033, 2020. [3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875, 2017. [4] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. Mine: mutual information neural estimation. ar Xiv preprint ar Xiv:1801.04062, 2018. [5] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard Schölkopf, and Alex J Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49 e57, 2006. [6] Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200 217, 1967. [7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. ar Xiv preprint ar Xiv:1509.00519, 2015. [8] Jie Chen and Ronny Luss. Stochastic gradient descent with biased but consistent gradient estimators. ar Xiv preprint ar Xiv:1807.11880, 2018. [9] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606 3613, 2014. [10] Imre Csiszár. Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akad. Mat. Kutato Int. Koezl., 8:85 108, 1964. [11] Bo Dai, Hanjun Dai, Arthur Gretton, Le Song, Dale Schuurmans, and Niao He. Kernel exponential family estimation via doubly dual embedding. ar Xiv preprint ar Xiv:1811.02228, 2018. [12] Bo Dai, Zhen Liu, Hanjun Dai, Niao He, Arthur Gretton, Le Song, and Dale Schuurmans. Exponential family estimation via adversarial dynamics embedding. ar Xiv preprint ar Xiv:1904.12083, 2019. [13] A Philip Dawid. Coherent measures of discrepancy, uncertainty and dependence, with applications to bayesian predictive experimental design. Department of Statistical Science, University College London. http://www. ucl. ac. uk/Stats/research/abs94. html, Tech. Rep, 139, 1998. [14] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. [15] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016. [16] Yilun Du, Toru Lin, and Igor Mordatch. Model based planning with energy based models. ar Xiv preprint ar Xiv:1909.06878, 2019. [17] Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems 32, pages 3603 3613, 2019. [18] Axel Finke and Alexandre H Thiery. On importance-weighted autoencoders. ar Xiv preprint ar Xiv:1907.10477, 2019. [19] Daniel Friedman. Effective scoring rules for probabilistic forecasts. Management Science, 29(4):447 454, 1983. [20] Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99(9):2053 2081, 2008. [21] Ruiqi Gao, Erik Nijkamp, Diederik P Kingma, Zhen Xu, Andrew M Dai, and Ying Nian Wu. Flow contrastive estimation of energy-based models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7518 7528, 2020. [22] Paul H Garthwaite, Joseph B Kadane, and Anthony O Hagan. Statistical methods for eliciting probability distributions. Journal of the American Statistical Association, 100(470):680 701, 2005. [23] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881 889, 2015. [24] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. [25] Alison L Gibbs and Francis Edward Su. On choosing and bounding probability metrics. International statistical review, 70(3):419 435, 2002. [26] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, 2007. [27] IJ Good. Comment on measuring information and uncertainty by robert j. buehler. Foundations of Statistical Inference, pages 337 339, 1971. [28] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014. [29] Will Grathwohl, Jacob Kelly, Milad Hashemi, Mohammad Norouzi, Kevin Swersky, and David Duvenaud. No mcmc for me: Amortized sampling for fast and stable training of energy-based models. ar Xiv preprint ar Xiv:2010.04230, 2020. [30] Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. ar Xiv preprint ar Xiv:1912.03263, 2019. [31] Ulf Grenander, Michael I Miller, Michael Miller, et al. Pattern theory: from representation to inference. Oxford university press, 2007. [32] David Gross. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3):1548 1566, 2011. [33] Michael U Gutmann and Jun-ichiro Hirayama. Bregman divergence as general framework to estimate unnormalized statistical models. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 283 290, 2011. [34] Michael U Gutmann and Aapo Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. The journal of machine learning research, 13(1):307 361, 2012. [35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [36] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pages 6626 6637, 2017. [37] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771 1800, 2002. [38] Tito Homem-de Mello. On rates of convergence for stochastic optimization problems under non independent and identically distributed sampling. SIAM Journal on Optimization, 19(2):524 551, 2008. [39] Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695 709, 2005. [40] Pierre E Jacob, John O Leary, and Yves F Atchadé. Unbiased markov chain monte carlo with couplings. ar Xiv preprint ar Xiv:1708.03625, 2017. [41] Shihao Ji, SVN Vishwanathan, Nadathur Satish, Michael J Anderson, and Pradeep Dubey. Blackout: Speeding up recurrent neural network language models with very large vocabularies. ar Xiv preprint ar Xiv:1511.06909, 2015. [42] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. What is local optimality in nonconvexnonconcave minimax optimization? ar Xiv preprint ar Xiv:1902.00618, 2019. [43] Takafumi Kanamori and Hironori Fujisawa. Robust estimation under heavy contamination using unnormalized models. Biometrika, 102(3):559 572, 2015. [44] Takafumi Kanamori, Hironori Fujisawa, et al. Affine invariant divergences associated with proper composite scoring rules and their applications. Bernoulli, 20(4):2278 2304, 2014. [45] Durk P Kingma and Yann L Cun. Regularized estimation of image statistics by score matching. In Advances in neural information processing systems, pages 1126 1134, 2010. [46] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215 10224, 2018. [47] Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1895 1904. JMLR. org, 2017. [48] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [49] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012. [50] Rithesh Kumar, Sherjil Ozair, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. Maximum entropy generators for energy-based models. ar Xiv preprint ar Xiv:1901.08508, 2019. [51] Matthieu Labeau and Shay B Cohen. Experimenting with power divergences for language modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4095 4105, 2019. [52] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. ar Xiv preprint ar Xiv:1212.2002, 2012. [53] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 29 37, 2011. [54] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [55] Yann Le Cun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006. [56] Yingzhen Li and Richard E Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, pages 1073 1081, 2016. [57] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. [58] James Martens, Ilya Sutskever, and Kevin Swersky. Estimating the hessian by back-propagating curvature. ar Xiv preprint ar Xiv:1206.6464, 2012. [59] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? ar Xiv preprint ar Xiv:1801.04406, 2018. [60] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018. [61] Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. In International Conference on Machine Learning, pages 7176 7185. PMLR, 2020. [62] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011. [63] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009. [64] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. [65] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011. [66] Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Learning non-convergent non-persistent short-run mcmc toward energy-based model. In Advances in Neural Information Processing Systems, pages 5233 5243, 2019. [67] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271 279, 2016. [68] Art B. Owen. Monte Carlo theory, methods and examples. 2013. [69] Matthew Parry, A Philip Dawid, Steffen Lauritzen, et al. Proper local scoring rules. The Annals of Statistics, 40(1):561 592, 2012. [70] Matthew Parry et al. Linear scoring rules for probabilistic binary classification. Electronic Journal of Statistics, 10(1):1596 1607, 2016. [71] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024 8035, 2019. [72] Hoifung Poon and Pedro Domingos. Sum-product networks: A new deep architecture. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 689 690. IEEE, 2011. [73] Yixuan Qiu, Lingsong Zhang, and Xiao Wang. Unbiased contrastive divergence algorithm for training energy-based latent variable models. [74] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314 323, 2016. [75] Christian Robert and George Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013. [76] Thornton B Roby. Belief states and the uses of evidence. Behavioral Science, 10(3):255 270, 1965. [77] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234 2242, 2016. [78] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [79] Jiaming Song and Stefano Ermon. Bridging the gap between f-gans and wasserstein gans. ar Xiv preprint ar Xiv:1910.09779, 2019. [80] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11895 11907, 2019. [81] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems, 2020. [82] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, page 204, 2019. [83] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574 584. PMLR, 2020. [84] Takashi Takenouchi and Takafumi Kanamori. Empirical localization of homogeneous divergences on discrete sample spaces. In Advances in Neural Information Processing Systems, pages 820 828, 2015. [85] Masatoshi Uehara, Takafumi Kanamori, Takashi Takenouchi, and Takeru Matsuda. A unified statistically efficient estimation framework for unnormalized models. In International Conference on Artificial Intelligence and Statistics, pages 809 819, 2020. [86] Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014. [87] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011. [88] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681 688, 2011. [89] Lantao Yu, Yang Song, Jiaming Song, and Stefano Ermon. Training deep energy-based models with f-divergence minimization. In International Conference on Machine Learning, pages 10957 10967. PMLR, 2020. [90] Jun Zhang, Yaokun Lei, Yi Isaac Yang, and Yi Qin Gao. Deep learning for multi-scale molecular modeling. 2020. [91] Shengjia Zhao, Hongyu Ren, Arianna Yuan, Jiaming Song, Noah Goodman, and Stefano Ermon. Bias and generalization in deep generative models: An empirical study. In Advances in Neural Information Processing Systems, pages 10792 10801, 2018.