# autoregressive_energy_machines__eb4f7f40.pdf Autoregressive Energy Machines Charlie Nash * 1 Conor Durkan * 1 Neural density estimators are flexible families of parametric models which have seen widespread use in unsupervised machine learning in recent years. Maximum-likelihood training typically dictates that these models be constrained to specify an explicit density. However, this limitation can be overcome by instead using a neural network to specify an energy function, or unnormalized density, which can subsequently be normalized to obtain a valid distribution. The challenge with this approach lies in accurately estimating the normalizing constant of the high-dimensional energy function. We propose the Autoregressive Energy Machine, an energy-based model which simultaneously learns an unnormalized density and computes an importance-sampling estimate of the normalizing constant for each conditional in an autoregressive decomposition. The Autoregressive Energy Machine achieves state-of-theart performance on a suite of density-estimation tasks. 1. Introduction Modeling the joint distribution of high-dimensional random variables is a key task in unsupervised machine learning. In contrast to other unsupervised approaches such as variational autoencoders (Kingma & Welling, 2013; Rezende et al., 2014) or generative adversarial networks (Goodfellow et al., 2014), neural density estimators allow for exact density evaluation, and have enjoyed success in modeling natural images (van den Oord et al., 2016b; Dinh et al., 2017; Salimans et al., 2017; Kingma & Dhariwal, 2018), audio data (van den Oord et al., 2016a; Prenger et al., 2018; Kim et al., 2018), and also in variational inference (Rezende & Mohamed, 2015; Kingma et al., 2016). Neural density estimators are particularly useful where the focus is on ac- *Equal contribution 1School of Informatics, University of Edinburgh, United Kingdom. Correspondence to: Charlie Nash , Conor Durkan . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). curate density estimation rather than sampling, and these models have seen use as surrogate likelihoods (Papamakarios et al., 2019) and approximate posterior distributions (Papamakarios & Murray, 2016; Lueckmann et al., 2017) for likelihood-free inference. (b) Res MADE Figure 1: Accurately modeling a distribution with sharp transitions and high-frequency components, such as the distribution of light in a image (a), is a challenging task. We find that an autoregressive energy-based model (c) is able to preserve fine detail lost by an alternative model (b) with explicit conditionals. Neural networks are flexible function approximators, and promising candidates to learn a probability density function. Typically, neural density models are normalized a priori, but this can hinder flexibility and expressiveness. For instance, many flow-based density estimators (Dinh et al., 2017; Papamakarios et al., 2017; Huang et al., 2018) rely on invertible transformations with tractable Jacobian which map data to a simple base density, so that the log probability of an input point can be evaluated using a change of variables. Autoregressive density estimators (Uria et al., 2013; Germain et al., 2015) often rely on mixtures of parametric distributions to model each conditional. Such families can make it difficult to model the low-density regions or sharp transitions characterized by multi-modal or discontinuous densities, respectively. The contributions of this work are shaped by two main Autoregressive Energy Machines Figure 2: Importance sampling estimates of log normalizing constants deteriorate with increasing dimension. The target and proposal distributions are spherical Gaussians with σ = 1 and σ = 1.25, respectively. The true log normalizing constant is log Z = 0. We plot the distribution of estimates over 50 trials, with each trial using 20 importance samples. observations. An energy function, or unnormalized density, fully characterizes a probability distribution, and neural networks may be better suited to learning such an energy function rather than an explicit density. Decomposing the density estimation task in an autoregressive manner makes it possible to train such an energy-based model by maximum likelihood, since it is easier to obtain reliable estimates of normalizing constants in low dimensions. Based on these observations, we present a scalable and efficient learning algorithm for an autoregressive energybased model, which we term the Autoregressive Energy Machine (AEM). Figure 3 provides a condensed overview of how an AEM approximates the density of an input point. 2. Background 2.1. Autoregressive neural density estimation A probability density function assigns a non-negative scalar value p(x) to each vector-valued input x, with the property that R p(x) dx = 1 over its support. Given a dataset D = x(n) N n=1 of N i.i.d. samples drawn from some unknown D-dimensional distribution p (x), the density estimation task is to determine a model p(x) such that p(x) p (x). Neural density estimators are parametric models that make use of neural network components to increase their capacity to fit complex distributions, and autoregressive neural models are among the best performing The product rule of probability allows us to decompose any joint distribution p(x) into a product of conditional distributions: d=1 p(xd|x