# exponential_family_estimation_via_adversarial_dynamics_embedding__feab77f1.pdf Exponential Family Estimation via Adversarial Dynamics Embedding Bo Dai1, Zhen Liu2, Hanjun Dai1, Niao He3, Arthur Gretton4, Le Song5,6, Dale Schuurmans1,7 1Google Research, Brain Team, 2Mila, University of Montreal, 3University of Illinois at Urbana Champaign, 4University College London, 5Georgia Institute of Technology, 6Ant Financial, 7University of Alberta We present an efficient algorithm for maximum likelihood estimation (MLE) of exponential family models, with a general parametrization of the energy function that includes neural networks. We exploit the primal-dual view of the MLE with a kinetics augmented model to obtain an estimate associated with an adversarial dual sampler. To represent this sampler, we introduce a novel neural architecture, dynamics embedding, that generalizes Hamiltonian Monte-Carlo (HMC). The proposed approach inherits the flexibility of HMC while enabling tractable entropy estimation for the augmented model. By learning both a dual sampler and the primal model simultaneously, and sharing parameters between them, we obviate the requirement to design a separate sampling procedure once the model has been trained, leading to more effective learning. We show that many existing estimators, such as contrastive divergence, pseudo/composite-likelihood, score matching, minimum Stein discrepancy estimator, non-local contrastive objectives, noise-contrastive estimation, and minimum probability flow, are special cases of the proposed approach, each expressed by a different (fixed) dual sampler. An empirical investigation shows that adapting the sampler during MLE can significantly improve on state-of-the-art estimators1. 1 Introduction The exponential family is one of the most important classes of distributions in statistics and machine learning, encompassing undirected graphical models (Wainwright and Jordan, 2008) and energybased models (Le Cun et al., 2006; Wu et al., 2018), which include, for example, Markov random fields (Kinderman and Snell, 1980), conditional random fields (Lafferty et al., 2001) and language models (Mnih and Teh, 2012). Despite the flexibility of this family and the many useful properties it possesses (Brown, 1986), most such distributions are intractable because the partition function does not possess an analytic form. This leads to difficulty in evaluating, sampling and learning exponential family models, hindering their application in practice. In this paper, we consider a longstanding question: Can a simple yet effective algorithm be developed for estimating general exponential family distributions? There has been extensive prior work addressing this question. Many approaches focus on approximating maximum likelihood estimation (MLE), since it is well studied and known to possess desirable statistical properties, such as consistency, asymptotic unbiasedness, and asymptotic normality (Brown, 1986). One prominent example is contrastive divergence (CD) (Hinton, 2002) and its variants (Tiele- man and Hinton, 2009; Du and Mordatch, 2019). It approximates the gradient of the log-likelihood by a stochastic estimator that uses samples generated from a few Markov chain Monte Carlo (MCMC) steps. This approach has two shortcomings: first and foremost, the stochastic gradient is biased, indicates equal contribution. Email: {bodai, hadai}@google.com, zhen.liu.2@umontreal.ca. 1The code repository is available at https://github.com/lzzcd001/ade-code. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. which can lead to poor estimates; second, CD and its variants require careful design of the MCMC transition kernel, which can be challenging. Given these difficulties with MLE, numerous learning criteria have been proposed to avoid the partition function. Pseudo-likelihood estimators (Besag, 1975) approximate the joint distribution by the product of conditional distributions, each of which only represents the distribution of a single random variable conditioned on the others. However, the the partition function of each factor is still generally intractable. Score matching (Hyvärinen, 2005) minimizes the Fisher divergence between the empirical distribution and the model. Unfortunately, it requires third order derivatives for optimization, which becomes prohibitive for large models (Kingma and Le Cun, 2010; Li et al., 2019). Noise-contrastive estimation (Gutmann and Hyvärinen, 2010) recasts the problem as ratio estimation between the target distribution and a pre-defined auxiliary distribution. However, the auxiliary distribution must cover the support of the data with an analytical expression that still allows efficient sampling; this requirement is difficult to satisfy in practice, particularly in high dimensional settings. Minimum probability flow (Sohl-Dickstein et al., 2011) exploits the observation that, ideally, the empirical distribution will be the stationary distribution of transition dynamics defined under an optimal model. The model can then be estimated by matching these two distributions. Even though this idea is inspiring, it is challenging to construct appropriate dynamics that yield efficient learning. In this paper, we introduce a novel algorithm, Adversarial Dynamics Embedding (ADE), that directly approximates the MLE while achieving computational and statistical efficiency. Our development starts with the primal-dual view of the MLE (Dai et al., 2019) that provides a natural objective for jointly learning both a sampler and a model, as a remedy for the expensive and biased MCMC steps in the CD algorithm. To parameterize the dual distribution, Dai et al. (2019) applies a naive transport mapping, which makes entropy estimation difficult and requires learning an extra auxiliary model, incurring additional computational and memory cost. We overcome these shortcomings by considering a different approach, inspired by the properties of Hamiltonian Monte-Carlo (HMC) (Neal, 2011): i) HMC forms a stationary distribution with independent potential and kinetic variables; ii) HMC can approximate the exponential family arbitrarily closely. As in HMC, we consider an augmented model with latent kinetic variables in Section 3.1, and introduce a novel neural architecture in Section 3.2, called dynamics embedding, that mimics sampling and represents the dual distribution via parameters of the primal model. This approach shares with HMC the advantage of a tractable entropy function for the augmented model, while enriching the flexibility of sampler without introducing extra parameters. In Section 3.3 we develop a maxmin objective that allows the shared parameters in primal model and dual sampler to be learned simultaneously, which improves computational and sample efficiency. We further show that the proposed estimator subsumes CD, pseudo-likelihood, score matching, non-local contrastive objectives, noise-contrastive estimation, and minimum probability flow as special cases with hand-designed dual samplers in Section 4. Finally, in Section 5 we find that the proposed approach can outperform current state-of-the-art estimators in a series of experiments. 2 Preliminaries Exponential family and energy-based model The natural form of the exponential family over Rd is defined as pf 0 (x) = exp (f 0(x) log p0 (x) Ap0 (f 0)) , Ap0 (f 0) := log exp (f 0 (x)) p0 (x) dx, (1) where f 0 (x) = w>φ$ (x). The sufficient statistic φ$ ( ) : ! Rk can be any general parametric model, e.g., a neural network. The (w, $) are the parameters to be learned from observed data. The exponential family definition (1) includes the energy-based model (Le Cun et al., 2006) as a special case, by setting f 0 (x) = φ$ (x) with k = 1, which has been generalized to the infinite dimensional case (Sriperumbudur et al., 2017). The p0 (x) is fixed and covers the support , which is usually unknown in practical high-dimensional problems. Therefore, we focus on learning f (x) = f 0 (x) log p0 (x) jointly with p0 (x), which is more difficult: in particular, the doubly dual embedding approach (Dai et al., 2019) is no longer applicable. Given a sample D = [xi]N i=1 and denoting f 2 F as the valid parametrization family, an exponential family model can be estimated by maximum log-likelihood, i.e., maxf2F L (f) := b ED [f (x)] A (f) , A (f) = log exp (f (x)) dx, (2) with gradient rf L (f) = b ED [rff (x)] Epf (x) [rff (x)]. Since A (f) and Epf (x) [rff (x)] are both intractable, solving the MLE for a general exponential family model is very difficult. Dynamics-based MCMC Dynamics-based MCMC is a general and effective tool for sampling. The idea is to represent the target distribution as the solution to a set of (stochastic) differential equations, which allows samples from the target distribution to be obtained by simulating along the dynamics defined by the differential equations. HMC (Neal, 2011) is a representative algorithm in this category, which exploits the well-known Hamiltonian dynamics. Specifically, given a target distribution pf (x) / exp (f (x)), the Hamiltonian is defined as H (x, v) = f (x) + k (v), where k (v) = 1 2v>v is the kinetic energy. The Hamiltonian dynamics generate (x, v) over time t by following dx = [@v H (x, v) , @x H (x, v)] = [v, rxf (x)] . (3) Asymptotically as t ! 1, x visits the underlying space according to the target distribution. In practice, to reduce discretization error, an acceptance-rejection step is introduced. The finite-step dynamics-based MCMC sampler can be used for approximating Epf (x) [rff (x)] in rf L (f), which leads to the CD algorithm (Hinton, 2002; Zhu and Mumford, 1998). Primal-dual view of MLE The Fenchel duality of A (f) has been exploited (Rockafellar, 1970; Wainwright and Jordan, 2008; Dai et al., 2019) as another way to address the intractability of the log-partition function. Theorem 1 (Fenchel dual of log-partition (Wainwright and Jordan, 2008)) Let H (q) := q (x) log q (x)dx. Then: A (f) = maxq2P hq(x), f (x)i + H (q) , pf (x) = argmaxq2P hq(x), f(x)i + H (q) , (4) where P denotes the space of distributions and hf, gi = f (x) g (x) dx. Plugging the Fenchel dual of A (f) into the MLE (2), we arrive at a max-min reformulation b ED [f(x)] Eq(x) [f(x)] H (q) , (5) which bypasses the explicit computation of the partition function. Another byproduct of the primaldual view is that the dual distribution can be used for inference, however in vanila estimators this usually requires expensive sampling algorithms. The dual sampler q ( ) plays a vital role in the primal-dual formulation of the MLE in (5). To achieve better performance, we have several principal requirements in parameterizing the dual distribution: i) the parametrization family needs to be flexible enough to achieve small error in solving the inner minimization problem; ii) the entropy of the parametrized dual distribution should be tractable. Moreover, as shown in (4) in Theorem 1, the optimal dual sampler q ( ) is determined by primal potential function f ( ). This leads to the third requirement: iii) the parametrized dual sampler should explicitly incorporate the primal model f. Such a dependence can potentially reduce both the memory and learning sample complexity. A variety of techniques have been developed for distribution parameterization, such as reparametrized latent variable models (Kingma and Welling, 2014; Rezende et al., 2014), transport mapping (Goodfellow et al., 2014), and normalizing flow (Rezende and Mohamed, 2015; Dinh et al., 2017; Kingma et al., 2016). However, none of these satisfies the requirements of flexibility and a tractable density simultaneously, nor do they offer a principled way to couple the parameters of the dual sampler with the primal model. 3 Adversarial Dynamics Embedding By augmenting the original exponential family with kinetic variables, we can parametrize the dual sampler with a dynamics embedding that satisfies all three requirements without effecting the MLE, allowing the primal potential function and dual sampler to both be trained adversarially. We start with the embedding of classical Hamiltonian dynamics (Neal, 2011; Caterini et al., 2018) for the dual sampler parametrization, as a concrete example, then discuss its generalization in latent space and the stochastic Langevin dynamics embedding. This technique is extended to other dynamics, with their own advantages, in Appendix B. 3.1 Primal-Dual View of Augmented MLE As noted, it is difficult to find a parametrization of q (x) in (5) that simultaneously satisfies all three requirements. Therefore, instead of directly tackling (5) in the original model, and inspired by HMC, we consider the augmented exponential family p (x, v) with an auxiliary momentum variable, i.e., 2 v>v) Z(f) , Z (f) = The MLE of such a model can be formulated as maxf L (f) := b Ex D p (x, v) dv = b Ex DEp(v|x) 2 v>v log p (v|x) (7) where the last equation comes from true posterior p (v|x) = N due to the independence of x and v. This independence also induces the equivalent MLE as proved in Appendix A. Theorem 2 (Equivalent MLE) The MLE of the augmented model is the same as the original MLE. Applying the Fenchel dual to Z (f) of the augmented model (6), we derive a primal-dual formulation of (7), leading to the objective, L (f) / minq(x,v)2P b Ex D [f (x)] Eq(x,v) 2 v>v log q (x, v) The q (x, v) in (8) contains momentum v as the latent variable. One can also exploit the latent variable model for q (x) = q (x|v) q (v) dv in (5). However, the H (q) in (5) requires marginalization, which is intractable in general, and usually estimated through variational inference with the introduction of an extra posterior model q (v|x). Instead, by considering the specifically designed augmented model, (8) eliminates these extra variational steps. Similarly, one can consider the latent variable augmented model with multiple momenta, i.e., Z(f) , leading to the optimization L (f) / minq(x,{vi}T i=1)2P b Ex D [f (x)] Eq(x,{vi}T 3.2 Representing Dual Sampler via Primal Model We now introduce the Hamiltonian dynamics embedding to represent the dual sampler q ( ), as well as its generalization and special instantiation that satisfy all three of the principal requirements. The vanilla HMC is derived by discretizing the Hamiltonian dynamics (3) with a leapfrog integrator. Specifically, in a single time step, the sample (x, v) moves towards (x0, v0) according to (x0, v0) = Lf, (x, v) := 2rxf (x) x0 = x + v where is defined as the leapfrog stepsize. Let us denote the one-step leapfrog as (x0, v0) = Lf, (x, v) and assume the (x, v). After T iterations, we obtain % x T , v T & = Lf, Lf, . . . Lf, . (11) Note that this can be viewed as a neural network with a special architecture, which we term Hamiltonian (HMC) dynamics embedding. Such a representation explicitly characterizes the dual sampler by the primal model, i.e., the potential function f, meeting the dependence requirement. The flexibility of the distributions HMC embedding actually is ensured by the nature of the dynamicsbased samplers. In the limiting case, the proposed neural network (11) reduces to a gradient flow, whose stationary distribution is exactly the model distribution: p (x, v) = argmaxq(x,v)2P Eq(x,v) 2 v>v log q (x, v) . The approximation strength of the HMC embedding is formally justified as follows: Theorem 3 (HMC embeddings as gradient flow) In continuous time, i.e. with infinitesimal stepsize ! 0, the density of particles (xt, vt), denoted qt (x, v), follows the Fokker-Planck equation @t = r (qt (x, v) Gr H (x, v)) , (12) , which has a stationary distribution p (x, v) / exp ( H (x, v)) with the marginal distribution p(x) / exp (f(x)). Details of the proofs are given in Appendix A. Note that this stationary distribution result is an instance of the more general dynamics described in Ma et al. (2015), showing the flexility of the induced distributions. As demonstrated in Theorem 3, the neural parametrization formed by the HMC embedding is able to well approximate an exponential family distribution on continuous variables. Remark (Generalized HMC dynamics in latent space) The leapfrog operation in vanilla HMC works directly in the original observation space, which could be high-dimensional and noisy. We generalize the leapfrog update rule to the latent space and form a new dynamics as follows, (x0, v0) = Lf, ,S,g (x, v) := 1 2 = v exp (Sv (rxf (x) , x)) + 2gv (rxf (x) , x) x0 = x exp 1 2 exp (Sv (rxf (x0) , x0)) + 2gv (rxf (x0) , x0) where v 2 Rl denote the momentum evolving space and denotes element-wise product. Specifically, the terms Sv (rxf (x) , x) and Sx rescale v and x coordinatewise. The term gv (rxf (x) , x) 7! Rl can be understood as projecting the gradient information to the essential latent space where the momentum is evolving. Then, for updating x, the latent momentum is projected back to original space via gx 7! . With these generalized leapfrog updates, the dynamical system avoids operating in the high-dimensional noisy input space, and becomes more computationally efficient. We emphasize that the proposed generalized leapfrog parametrization (13) is different from the one used in Levy et al. (2018), which is inspired from the real-NVP flow (Dinh et al., 2017). By the generalized HMC embedding (13), we have a flexible layer (x0, v0) = Lf, ,S,g (x, v), where (Sv, Sx, gv, gx) will be learned in addition to the stepsize. Obviously, the classic HMC layer Lf, ,M (x, v) is a special case of Lf, ,S,g (x, v) by setting (Sv, Sx) to zero and (gv, gf) to identity functions. Remark (Stochastic Langevin dynamics) The stochastic Langevin dynamics can also be recovered from the leapfrog step by resampling momentum in every step. Specifically, the sample (x, ) moves according to (x0, v0) = L 2rxf (x) x0 = x + v0 , with q ( ) . (14) Hence, stochastic Langevin dynamics resample to replace the momentum in leapfrog (10), ignoring the accumulated gradients. By unfolding T updates, we obtain f, . . . L 0 as the derived neural network. Similarly, we can also generalize the stochastic Langevin updates L f, to a low-dimension latent space by introducing gv (rxf (x) , x) and gx (v0) correspondingly. One of the major advantages of the proposed distribution parametrization is its density value is also tractable, leading to tractable entropy estimation in (8) and (9). In particular, we have the following, Theorem 4 (Density value evaluation) If (x, v), after T vanilla HMC steps (10), then x T , v T & x T , v T & from the generalized leapfrog steps (13), we have x T , v T & t=1 ( x (xt) v (vt)) , (17) where x (xt) and v (vt) denote x (xt) = |det (diag (exp (2Sv (rxf (xt) , xt))))| , v (vt) = from the Langevin dynamics (14) with i=i q i ( ), we have x0, 0& QT 1 The proof of Theorem 4 can be found in Appendix A. The proposed dynamics embedding satisfies all three requirements: it defines a flexible family of distributions with computable entropy; and couples the learning of the dual sampler with the primal model, leading to memory and sample efficient learning algorithms, as we introduce in next section. 3.3 Coupled Model and Sampler Learning By plugging the T-step Hamiltonian dynamics embedding (10) into the primal-dual MLE of the augmented model (8) and applying the density value evaluation (16), we obtain the proposed optimization, which learns primal potential f and the dual sampler adversarially, maxf2F min (f, ) := b ED [f] E(x0,v0) q0 Here denotes the learnable components in the dynamics embedding, e.g., initialization q0 , the stepsize ( ) in the HMC/Langevin updates, and the adaptive part (Sv, Sx, gv, gx) in the generalized HMC. The parametrization of the initial distribution is discussed in Appendix C. Compared to the optimization in GANs (Goodfellow et al., 2014; Arjovsky et al., 2017; Dai et al., 2017), beside the reversal of min-max in (20), the major difference is that our generator (the dual sampler) shares parameters with the discriminator (the primal potential function). In our formulation, the updates of the potential function automatically push the generator toward the target distribution, thus accelerating learning efficiency. Meanwhile, the tunable parameters in the dynamics embedding are learned adversarially, further promoting the efficiency of the dual sampler. These benefits will be empirically demonstrated in Section 5. Similar optimization can be derived for generalized HMC (13) with density (17). For the T-step stochastic Langevin dynamics embedding (14), we apply the density value (19) to (9), which also leads to a max-min optimization with multiple momenta. Algorithm 1 MLE via Adversarial Dynamics Embedding (ADE) 1: Initialize 1 randomly, set length of steps T. 2: for iteration k = 1, . . . , K do 3: Sample mini-batch {xi}m i=1 from dataset D and i=1 from q0 (x, v). 4: for iteration t = 1, . . . , T do 5: Compute (xt, vt) = L xt 1, vt 1& for each pair of i=1. 6: end for 7: [Learning the sampler] k+1 = k γk ˆr (fk; k) 8: [Estimating the exponential family] fk+1 = fk + γk ˆrf (fk; k). 9: end for We use stochastic gradient descent to estimate f for the exponential families as well as the parameters of the dynamics embedding adversarially. Note that since the generated sample (x T f ) depends on f, the gradient w.r.t. f should also take these variables into account as back-propagation through time (BPTT), i.e., rf (f; ) = b ED [rff (x)] Eq0 rfx T + λv T rfv T . (21) We illustrate the MLE via HMC adversarial dynamics embedding in Algorithm 1. The same technique can be applied to alternative dynamics embeddings parametrized dual sampler as in Appendix B. Considering the dynamics embedding as an adaptive sampler that automatically learns w.r.t. different models and datasets, the updates for can be understood as learning to sample. 4 Related Work Table 1: (Fix) dual samplers used in alternative estimators. We denote p D as the empirical data distribution, x i as x without i-th coordinate, pn as the prefixed noise distribution, Tf (x0|x) as the HMC/Langevin transition kernel, TD,f (x) as the Stein variational gradient descent, and A (x, x0) as the acceptance ratio. Estimators Dual Sampler q(x) CD A(xi, xi 1)p D (x0) dx T 1 Tf (x0|x) p D (x) dx DSKD x0 = TD,f (x) PL q(x) = 1 i=1 pf(xi|x i)p D(x i) CL q(x) = 1 i=1 pf(x Ai|x Ai)p D(x Ai) {Ai}m i=1 = d and Ai \ Aj = ; NLCO Pm p(f,i) (x) p (Si|x0) p D (x0) dx p(f,i) (x) = exp(f(x)) Zi(f) , x 2 Si MPF Tf (x0|x) exp 2 (f (x0) f (x)) p D (x) dx NCE & exp(f(x)) exp(f(x))+pn(x) Connections to other estimators The primal-dual view of the MLE also allows us to establish connections between the proposed estimator, adversarial dynamics embedding (ADE), and existing approaches, including contrastive divergence (Hinton, 2002), pseudo-likelihood (PL) (Besag, 1975), conditional composite like- lihood (CL) (Lindsay, 1988), score matching (SM) (Hyvärinen, 2005), minimum (diffusion) Stein kernel discrepancy estimator (DSKD) (Barp et al., 2019), non-local contrastive objectives (NLCO) (Vickrey et al., 2010), minimum probability flow (MPF) (Sohl-Dickstein et al., 2011), and noise-contrastive estimation (NCE) (Gutmann and Hyvärinen, 2010). As summarized in Table 1, these existing estimators can be recast as the special cases of ADE, by replacing the adaptive dual sampler with hand-designed samplers, which can lead to extra error and inferior solutions. Appendix D gives detailed derivations of the connections. Exploiting deep models for energy-based model estimation has been investigated in Kim and Bengio (2016); Dai et al. (2017); Liu and Wang (2017); Dai et al. (2019). However, the parametrization of the dual sampler should both be flexible and tractable to achieve better performance. Existing work is limited in one aspect or another. Kim and Bengio (2016) parameterized the sampler via a deep directed graphical model, whose approximation ability is restrictive and the entropy is intractable. Dai et al. (2017) proposed algorithms relying either on a heuristic approximation or a lower bound of the entropy, and requiring learning an extra auxiliary component besides the dual sampler. Dai et al. (2019) applied the Fenchel dual representation twice to reformulate the entropy term, but the algorithm requires knowing a proposal distribution with the same support, which is impractical for high-dimensional data. By contrast, ADE achieves both sufficient flexibility and tractability by exploiting the augmented model and a novel parametrization within the primal-dual view. Learning to sample ADE also shares some similarity with meta learning for sampling (Levy et al., 2018; Feng et al., 2017; Song et al., 2017; Gong et al., 2019), where the sampler is parametrized via a neural network and learned through certain objectives. The most significant difference lies in the ultimate goal: we focus on exponential family model estimation, where the learned sampler assists with this objective. By contrast, learning to sample techniques target on a sampler for a fixed model. This fundamentally distinguishes ADE from methods that only learn samplers. Moreover, ADE exploits an augmented model that yields tractable entropy estimation, which has not been fully investigated in previous literature. 5 Experiments In this section, we test ADE on several synthetic datasets in Section 5.1 and real-world image datasets in Section 5.2. The details of each experiment setting can be found in Appendix F. 5.1 Synthetic experiments We compare ADE with SM, CD, and primal-dual MLE with the normalizing planar flow (Rezende and Mohamed, 2015) sampler (NF) to investigate the claimed benefits. SM, CD and primal-dual with NF can be viewed as special cases of our method, with either a fixed sampler or restricted parametrized q . Thus, this also serves as an ablation study of ADE to verify the significance of its different subcomponents. We keep the model sizes the same in NF and ADE (10 planar layers). Then we perform 5-steps stochastic Langevin steps to obtain the final samples x T with standard Gaussian noise in each step, and without incurring extra memory cost. For fairness, we conduct CD with 15 steps. This setup is preferable to CD with an extra acceptance-rejection step. We emphasize that, by comparison to SM and CD, ADE learns the sampler and exploits the gradients through the sampler. In comparison to primal-dual with NF, dynamics embedding achieves more flexibility without introducing extra parameters. Complete experiment details are given in Appendix F.1. Table 2: Comparison on synthetic data using maximum mean discrepancy (MMD 1e 3). Dataset SM NF CD-15 ADE 2spirals 5.09 0.69 -0.45 -0.61 Banana 8.10 0.88 -0.31 -0.99 circles 4.90 0.76 -0.83 -1.13 cos 10.36 0.91 7.15 -0.55 Cosine 8.34 2.15 0.78 -1.09 Funnel 13.07 -0.92 -0.38 -0.75 swissroll 19.93 1.97 0.20 -0.36 line 10.28 0.39 10.5 -1.30 moons 41.34 0.80 2.21 -1.10 Multiring 2.01 0.30 -0.38 -1.02 pinwheel 18.41 3.01 -1.03 -0.95 Ring 9.22 161.89 0.12 -0.91 Spiral 9.48 5.96 -0.41 -0.81 Uniform 5.88 0.00 -1.17 -0.94 In Figure 1, we visualize the learned distribution using both the learned dual sampler and the unnormalized exponential model on several synthetic datasets. Overall, the sampler almost perfectly recovers the distribution, and the learned f captures the landscape of the distribution. We also plot the convergence behavior in Figure 2. We observe that the samples are smoothly converging to the true data distribution. As the learned sampler depends on f, this figure also indirectly suggests good convergence behavior for f. More results for the learned models can be found in Figure 5 in Appendix G. A quantitative comparison in terms of the MMD (Gretton et al., 2012) of the samplers is in Table 2. To compute the MMD, for NF and ADE, we use 1,000 samples from their sam- (a) 2spirals (b) Cosine (c) moons (d) Multiring (e) pinwheel (f) Spiral Figure 1: We illustrated the learned samplers from different synthetic datasets in the first row. The denotes training data and denotes the ADE samplers. The learned potential functions f are illustrated in the second row. Figure 2: Convergence behavior of sampler on moons, Multiring, pinwheel synthetic datasets. pler with Gaussian kernel. The kernel bandwidth is chosen using median trick (Dai et al., 2016). For SM, since there is no such sampler available, we use vanilla HMC to get samples from the learned model f, and use them to estimate MMD as in Dai et al. (2019). As we can see from Table 2, ADE obtains the best MMD in most cases, which demonstrates the flexibility of dynamics embedding compared to normalizing flow, and the effectiveness of adversarial training compared to SM and CD. We also investigate the parameters recovery of ADE on the multivariate Gaussians with different dimensions where we know the potential functions. The empirical results can be found in Table 5 in Appendix G. In this simple task, the SM is proven to be consistent and achieve the same estimator as MLE (Hyvärinen, 2005). The objective of ADE can be non-convex due to the learning of the sampler parametrization, therefore, it losses the theoretical guarantees and incurs extra cost. However, as we can see the ADE still achieves comparable performances. 5.2 Real-world Image Datasets We apply ADE to MNIST and CIFAR-10 data. In both cases, we use a CNN architecture for the discriminator, following Miyato et al. (2018), with spectral normalization added to the discriminator layers. In particular, for the discriminator in the CIFAR-10 experiments, we replace all downsampling operations by average pooling, as in Du and Mordatch (2019). We parametrize the initial distribution p0 (x, v) with a deep Gaussian latent variable model (Deep LVM), specified in Appendix C. The output sample is clipped to [0, 1] after each HMC step and the Deep LVM initialization. The detailed architectures and experimental configurations are described in Appendix F.2. Table 3: Inception scores of different models on CIFAR-10 (unconditional). Model Inception Score WGAN-GP (Gulrajani et al., 2017) 6.50 Spectral GAN (Miyato et al., 2018) 7.42 Langevin PCD (Du and Mordatch, 2019) 6.02 Langevin PCD (10 ensemble) (Du and Mordatch, 2019) 6.78 ADE: Deep LVM init w/o HMC 7.26 ADE: Deep LVM init w/ HMC 7.55 (a) Samples on MNIST (b) Histogram on MNIST (c) Samples on CIFAR-10 (d) Histogram on CIFAR-10 Figure 3: The generated images on MNIST and CIFAR-10 and the comparison between energies of generated samples and real images. The blue histogram illustrates the distribution of f (x) on generated samples, and the orange histogram is generated by f (x) on testing samples. As we can see, the learned potential function f (x) matches the empirical dataset well. We report the inception scores in Table 3. For ADE, we train with Deep LVM as the initial q0 with/without HMC steps for an ablation study. The HMC embedding greatly improves the performance of the samples generated by the initial q0 alone. The proposed ADE not only achieves better performance, compared to the fixed Langevin PCD for energy-based models reported in (Du and Mordatch, 2019), but also enables the generator to outperform the Spectral GAN. We show some of the generated images in Figure 3(a) and (c); additional sampled images can be found in Figure 6 and 7 in Appendix G. We also plot the potential distribution (unnormalized) of the generated samples and that of the real images for MNIST and CIFAR-10 (using 1000 data points for each) in Figure 3(b) and (d). The energy distributions of both the generated and real images show significant overlap, demonstrating that the obtained energy functions have successfully learned the desired distributions. Figure 4: Image completion with the ADE learned model and sampler on MNIST. Since ADE learns an energy-based model, the learned model and sampler can also be used for image completion. To further illustrate the versatility of ADE, we provide several image completions on MNIST in Figure 4. Specifically, we estimate the model with ADE on fully observed images. For the input images, we mask the lower half with uniform noise. To complete the corrupted images, we perform the learned dual sampler steps to update the lower half of images with the upper half images fixed. We visualize the output from each of the 20 HMC runs in Figure 4. Further details are given in Appendix F.2. 6 Conclusion We proposed Adversarial Dynamics Embedding (ADE) to efficiently perform MLE with general exponential families. In particular, by utilizing the primal-dual formulation of the MLE for an augmented distribution with auxiliary kinetic variables, we incorporate the parametrization of the dual sampler into the estimation process in a fully differentiable way. This approach allows for shared parameters between the primal and dual, achieving better estimation quality and inference efficiency. We also established the connection between ADE and existing estimators. Our empirical results on both synthetic and real data illustrate the advantages of the proposed approach. Acknowledgments We thank David Duvenaud, Arnaud Doucet, George Tucker and the Google Brain team, as well as the anonymous reviewers for their insightful comments and suggestions. L.S. was supported in part by NSF grants CDS&E-1900017 D3SC, CCF-1836936 FMit F, IIS-1841351, CAREER IIS-1350983. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. In International Conference on Machine Learning, 2017. Alessandro Barp, Francois-Xavier Briol, Andrew B. Duncan, Mark Girolami, and Lester Mackey. Minimum Stein Discrepancy Estimators. ar Xiv preprint ar Xiv:1906.08283, 2019. Dimitri Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. Julian Besag. Statistical analysis of non-lattice data. The Statistician, 24:179 195, 1975. Christos Boutsidis, Petros Drineas, Prabhanjan Kambadur, Eugenia-Maria Kontopoulou, and Anas- tasios Zouzias. A randomized algorithm for approximating the log determinant of a symmetric positive definite matrix. Linear Algebra and its Applications, 533:95 117, 2017. Lawrence D. Brown. Fundamentals of Statistical Exponential Families, volume 9 of Lecture notes- monograph series. Institute of Mathematical Statistics, Hayward, Calif, 1986. Anthony L Caterini, Arnaud Doucet, and Dino Sejdinovic. Hamiltonian variational auto-encoder. In Advances in Neural Information Processing Systems, 2018. Bo Dai, Niao He, Hanjun Dai, and Le Song. Provable bayesian inference via particle mirror descent. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 985 994, 2016. Bo Dai, Hanjun Dai, Niao He, Weiyang Liu, Zhen Liu, Jianshu Chen, Lin Xiao, and Le Song. Coupled variational bayes via optimization embedding. In Advances in Neural Information Processing Systems, 2018. Bo Dai, Hanjun Dai, Arthur Gretton, Le Song, Dale Schuurmans, and Niao He. Kernel exponential family estimation via doubly dual embedding. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 2321-2330, 2019. Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, and Aaron Courville. Calibrating energy-based generative adversarial networks. In International Conference on Learning Representations , 2017. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In International Conference on Learning Representations , 2017. Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. ar Xiv preprint ar Xiv:1903.08689, 2019. Yihao Feng, Dilin Wang, and Qiang Liu. Learning to draw samples with amortized stein variational gradient descent. In Conference on Uncertainty in Artificial Intelligence, 2017. Wenbo Gong, Yingzhen Li, and José Miguel Hernández-Lobato. Meta-learning for stochastic gradient MCMC. In International Conference on Learning Representations , 2019. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672 2680, 2014. Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations , 2019. Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. JMLR, 13:723 773, 2012. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767 5777, 2017. Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297 304, 2010. Insu Han, Dmitry Malioutov, and Jinwoo Shin. Large-scale log-determinant computation through stochastic chebyshev expansions. In International Conference on Machine Learning, pages 908 917, 2015. Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771 1800, 2002. Aapo Hyvärinen. Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research, 6:695 709, 2005. Aapo Hyvärinen. Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. IEEE Transactions on neural networks, 18(5):1529-1531, 2007. Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation. ar Xiv preprint ar Xiv:1606.03439, 2016. Ross Kindermann and J. Laurie Snell Markov Random Fields and their applications. Amer. Math. Soc., Providence, RI, 1980. Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1 1 convolutions. In Advances in Neural Information Processing Systems, 2018. Diederik P Kingma and Yann Le Cun. Regularized estimation of image statistics by score matching. In NIPS, 2010. Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743 4751, 2016. John Lafferty, Andrew Mc Callum, and Fernando C.N. Pereira. Conditional random fields: Prob- abilistic modeling for segmenting and labeling sequence data. In Proceedings of International Conference on Machine Learning, volume 18, pages 282 289, San Francisco, CA, 2001. Morgan Kaufmann. Yann Le Cun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006. Daniel Levy, Matthew D Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte carlo with neural networks. In International Conference on Learning Representations , 2018. Bruce G. Lindsay. Composite likelihood methods. Contemporary Mathematics, 80(1):221 239, Qiang Liu and Dilin Wang. Learning deep energy models: Contrastive divergence vs. amortized MLE. ar Xiv preprint ar Xiv:1707.00797, 2017. Yi-An Ma, Tianqi Chen, and Emily Fox A complete recipe for stochastic gradient MCMC. In In Advances in Neural Information Processing Systems, pages 2917 2925, 2015. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations , 2018. Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Conference on International Conference on Machine Learning, 2012. Radford M Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2 (11), 2011. Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, pages 1278 1286, 2014. Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015. R. Tyrrell Rockafellar ockafellar Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, Princeton, NJ, 1970. Jascha Sohl-Dickstein, Peter Battaglino, and Michael R De Weese. Minimum probability flow learning. In Proceedings of the 28th International Conference on Machine Learning, pages 905 912, 2011. Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-MC: Adversarial training for mcmc. In Advances in Neural Information Processing Systems, pages 5140 5150, 2017. Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyvärinen, and Revant Kumar. Density estimation in infinite dimensional exponential families. The Journal of Machine Learning Research, 18(1):1830 1888, 2017. Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the International Conference on Machine Learning, 2008. Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive diver- gence. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1033 1040. ACM, 2009. David Vickrey, Cliff Chiung-Yu Lin, and Daphne Koller. Non-local contrastive objectives. In Proceedings of the International Conference on Machine Learning, 2010. Martin J. Wainwright and Michael I. Jordan Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008. Li Wenliang, Dougal Sutherland, Heiko Strathmann, and Arthur Gretton. Learning deep kernels for exponential family densities. In International Conference on Machine Learning, 2019. Ying Nian Wu, Jianwen Xie, Yang Lu, and Song-Chun Zhu. Sparse and deep generalizations of the frame model. Annals of Mathematical Sciences and Applications, 3(1):211 254, 2018. Linfeng Zhang, Weinan E, and Lei Wang. Monge-Ampère flow for generative modeling. ar Xiv preprint ar Xiv:1809.10188, 2018. Song Chun Zhu and David Mumford. Grade: Gibbs reaction and diffusion equations. In Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pages 847 854. IEEE, 1998.