# neural_autoregressive_distribution_estimation__4158c7ff.pdf Journal of Machine Learning Research 17 (2016) 1-37 Submitted 5/16; Published 9/16 Neural Autoregressive Distribution Estimation Benigno Uria benigno.uria@gmail.com Google Deep Mind London, UK Marc-Alexandre Cˆot e marc-alexandre.cote@usherbrooke.ca Department of Computer Science Universit e de Sherbrooke Sherbrooke, J1K 2R1, QC, Canada Karol Gregor karol.gregor@gmail.com Google Deep Mind London, UK Iain Murray i.murray@ed.ac.uk School of Informatics University of Edinburgh Edinburgh EH8 9AB, UK Hugo Larochelle hlarochelle@twitter.com Twitter 141 Portland St, Floor 6 Cambridge MA 02139, USA Editor: Ruslan Salakhutdinov We present Neural Autoregressive Distribution Estimation (NADE) models, which are neural network architectures applied to the problem of unsupervised distribution and density estimation. They leverage the probability product rule and a weight sharing scheme inspired from restricted Boltzmann machines, to yield an estimator that is both tractable and has good generalization performance. We discuss how they achieve competitive performance in modeling both binary and real-valued observations. We also present how deep NADE models can be trained to be agnostic to the ordering of input dimensions used by the autoregressive product rule decomposition. Finally, we also show how to exploit the topological structure of pixels in images using a deep convolutional architecture for NADE. Keywords: deep learning, neural networks, density modeling, unsupervised learning 1. Introduction Distribution estimation is one of the most general problems addressed by machine learning. From a good and flexible distribution estimator, in principle it is possible to solve a variety of types of inference problem, such as classification, regression, missing value imputation, and many other predictive tasks. Currently, one of the most common forms of distribution estimation is based on directed graphical models. In general these models describe the data generation process as sampling c 2016 Benigno Uria, Marc-Alexandre Cˆot e, Karol Gregor, Iain Murray, Hugo Larochelle. Uria, Cˆot e, Gregor, Murray, and Larochelle a latent state h from some prior p(h), followed by sampling the observed data x from some conditional p(x | h). Unfortunately, this approach quickly becomes intractable and requires approximations when the latent state h increases in complexity. Specifically, computing the marginal probability of the data, p(x) = P h p(x | h) p(h), is only tractable under fairly constraining assumptions on p(x | h) and p(h). Another popular approach, based on undirected graphical models, gives probabilities of the form p(x) = exp {φ(x)} /Z, where φ is a tractable function and Z is a normalizing constant. A popular choice for such a model is the restricted Boltzmann machine (RBM), which substantially out-performs mixture models on a variety of binary data sets (Salakhutdinov and Murray, 2008). Unfortunately, we often cannot compute probabilities p(x) exactly in undirected models either, due to the normalizing constant Z. In this paper, we advocate a third approach to distribution estimation, based on autoregressive models and feed-forward neural networks. We refer to our particular approach as Neural Autoregressive Distribution Estimation (NADE). Its main distinguishing property is that computing p(x) under a NADE model is tractable and can be computed efficiently, given an arbitrary ordering of the dimensions of x. The NADE framework was first introduced for binary variables by Larochelle and Murray (2011), and concurrent work by Gregor and Le Cun (2011). The framework was then generalized to real-valued observations (Uria et al., 2013), and to versions based on deep neural networks that can model the observations in any order (Uria et al., 2014). This paper pulls together an extended treatement of these papers, with more experimental results, including some by Uria (2015). We also report new work on modeling 2D images by incorporating convolutional neural networks into the NADE framework. For each type of data, we re able to reach competitive results, compared to popular directed and undirected graphical model alternatives. We consider the problem of modeling the distribution p(x) of input vector observations x. For now, we will assume that the dimensions of x are binary, that is xd {0, 1} d. The model generalizes to other data types, which is explored later (Section 3) and in other work (Section 8). NADE begins with the observation that any D-dimensional distribution p(x) can be factored into a product of one-dimensional distributions, in any order o (a permutation of the integers 1, . . . , D): d=1 p(xod | xo