# a_deep_and_tractable_density_estimator__d802abba.pdf A Deep and Tractable Density Estimator Benigno Uria B.URIA@ED.AC.UK Iain Murray I.MURRAY@ED.AC.UK School of Informatics, University of Edinburgh Hugo Larochelle HUGO.LAROCHELLE@USHERBROOKE.CA D epartement d informatique, Universit e de Sherbrooke The Neural Autoregressive Distribution Estimator (NADE) and its real-valued version RNADE are competitive density models of multidimensional data across a variety of domains. These models use a fixed, arbitrary ordering of the data dimensions. One can easily condition on variables at the beginning of the ordering, and marginalize out variables at the end of the ordering, however other inference tasks require approximate inference. In this work we introduce an efficient procedure to simultaneously train a NADE model for each possible ordering of the variables, by sharing parameters across all these models. We can thus use the most convenient model for each inference task at hand, and ensembles of such models with different orderings are immediately available. Moreover, unlike the original NADE, our training procedure scales to deep models. Empirically, ensembles of Deep NADE models obtain state of the art density estimation performance. 1. Introduction In probabilistic approaches to machine learning, large collections of variables are described by a joint probability distribution. There is considerable interest in flexible model distributions that can fit and generalize from training data in a variety of applications. To draw inferences from these models, we often condition on a subset of observed variables, and report the probabilities of settings of another subset of variables, marginalizing out any unobserved nuisance variables. The solutions to these inference tasks often cannot be computed exactly, and require iterative approximations such as Monte Carlo or variational methods (e.g., Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). Bishop, 2006). Models for which inference is tractable would be preferable. NADE (Larochelle & Murray, 2011), and its real-valued variant RNADE (Uria et al., 2013), have been shown to be state of the art joint density models for a variety of realworld datasets, as measured by their predictive likelihood. These models predict each variable sequentially in an arbitrary order, fixed at training time. Variables at the beginning of the order can be set to observed values, i.e., conditioned on. Variables at the end of the ordering are not required to make predictions; marginalizing these variables requires simply ignoring them. However, marginalizing over and conditioning on any arbitrary subsets of variables will not be easy in general. In this work, we present a procedure for training a factorial number of NADE models simultaneously; one for each possible ordering of the variables. The parameters of these models are shared, and we optimize the mean cost over all orderings using a stochastic gradient technique. After fitting the shared parameters, we can extract, in constant time, the NADE model with the variable ordering that is most convenient for any given inference task. While the different NADE models might not be consistent in their probability estimates, this property is actually something we can leverage to our advantage, by generating ensembles of NADE models on the fly (i.e., without explicitly training any such ensemble) which are even better estimators than any single NADE. In addition, our procedure is able to train a deep version of NADE incurring an extra computational expense only linear in the number of layers. 2. Background: NADE and RNADE Autoregressive methods use the product rule to factorize the probability density function of a D-dimensional vectorvalued random variable x as a product of one-dimensional A Deep and Tractable Density Estimator conditional distributions: d=1 p(xod | xo