# made_masked_autoencoder_for_distribution_estimation__6a05b921.pdf MADE: Masked Autoencoder for Distribution Estimation Mathieu Germain MATHIEU.GERMAIN2@USHERBROOKE.CA Universit e de Sherbrooke, Canada Karol Gregor KAROL.GREGOR@GMAIL.COM Google Deep Mind Iain Murray I.MURRAY@ED.AC.UK University of Edinburgh, United Kingdom Hugo Larochelle HUGO.LAROCHELLE@USHERBROOKE.CA Universit e de Sherbrooke, Canada There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder s parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with stateof-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. 1. Introduction Distribution estimation is the task of estimating a joint distribution p(x) from a set of examples {x(t)}T t=1, which is by definition a general problem. Many tasks in machine learning can be formulated as learning only specific properties of a joint distribution. Thus a good distribution estimator can be used in many scenarios, including classification (Schmah Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). et al., 2009), denoising or missing input imputation (Poon & Domingos, 2011; Dinh et al., 2014), data (e.g. speech) synthesis (Uria et al., 2015) and many others. The very nature of distribution estimation also makes it a particular challenge for machine learning. In essence, the curse of dimensionality has a distinct impact because, as the number of dimensions of the input space of x grows, the volume of space in which the model must provide a good answer for p(x) exponentially increases. Fortunately, recent research has made substantial progress on this task. Specifically, learning algorithms for a variety of neural network models have been proposed (Bengio & Bengio, 2000; Larochelle & Murray, 2011; Gregor & Le Cun, 2011; Uria et al., 2013; 2014; Kingma & Welling, 2014; Rezende et al., 2014; Bengio et al., 2014; Gregor et al., 2014; Goodfellow et al., 2014; Dinh et al., 2014). These algorithms are showing great potential in scaling to high-dimensional distribution estimation problems. In this work, we focus our attention on autoregressive models (Section 3). Computing p(x) exactly for a test example x is tractable with these models. However, the computational cost of this operation is still larger than typical neural network predictions for a D-dimensional input. For previous deep autoregressive models, evaluating p(x) costs O(D) times more than a simple neural network point predictor. This paper s contribution is to describe and explore a simple way of adapting autoencoder neural networks that makes them competitive tractable distribution estimators that are faster than existing alternatives. We show how to mask the weighted connections of a standard autoencoder to convert it into a distribution estimator. The key is to use masks that are designed in such a way that the output is autoregressive for a given ordering of the inputs, i.e. that each input dimension is reconstructed solely from the dimensions preceding it in the MADE: Masked Autoencoder for Distribution Estimation ordering. The resulting Masked Autoencoder Distribution Estimator (MADE) preserves the efficiency of a single pass through a regular autoencoder. Implementation on a GPU is straightforward, making the method scalable. The single hidden layer version of MADE corresponds to the previously proposed autoregressive neural network of Bengio & Bengio (2000). Here, we go further by exploring deep variants of the model. We also explore training MADE to work simultaneously with multiple orderings of the input observations and hidden layer connectivity structures. We test these extensions across a range of binary datasets with hundreds of dimensions, and compare its statistical performance and scaling to comparable methods. 2. Autoencoders A brief description of the basic autoencoder, on which this work builds upon, is required to clearly grasp what follows. In this paper, we assume that we are given a training set of examples {x(t)}T t=1. We concentrate on the case of binary observations, where for every D-dimensional input x, each input dimension xd belongs in {0, 1}. The motivation is to learn hidden representations of the inputs that reveal the statistical structure of the distribution that generated them. An autoencoder attempts to learn a feed-forward, hidden representation h(x) of its input x such that, from it, we can obtain a reconstruction bx which is as close as possible to x. Specifically, we have h(x) = g(b + Wx) (1) bx = sigm(c + Vh(x)) , (2) where W and V are matrices, b and c are vectors, g is a nonlinear activation function and sigm(a) = 1/(1 + exp( a)). Thus, W represents the connections from the input to the hidden layer, and V represents the connections from the hidden to the output layer. To train the autoencoder, we must first specify a training loss function. For binary observations, a natural choice is the cross-entropy loss: d=1 xd log bxd (1 xd) log(1 bxd) . (3) By treating bxd as the model s probability that xd is 1, the cross-entropy can be understood as taking the form of a negative log-likelihood function. Training the autoencoder corresponds to optimizing the parameters {W, V, b, c} to reduce the average loss on the training examples, usually with (mini-batch) stochastic gradient descent. One advantage of the autoencoder paradigm is its flexibility. In particular, it is straightforward to obtain a deep autoencoder by inserting more hidden layers between the input and output layers. Its main disadvantage is that the representation it learns can be trivial. For instance, if the hidden layer is at least as large as the input, hidden units can each learn to copy a single input dimension, so as to reconstruct all inputs perfectly at the output layer. One obvious consequence of this observation is that the loss function of Equation 3 isn t in fact a proper log-likelihood function. Indeed, since perfect reconstruction could be achieved, the implied data distribution q(x)=Q d bxxd d (1 bxd)1 xd could be learned to be 1 for any x and thus not be properly normalized (P x q(x) =1). 3. Distribution Estimation as Autoregression An interesting question is what property we could impose on the autoencoder, such that its output can be used to obtain valid probabilities. Specifically, we d like to be able to write p(x) in such a way that it could be computed based on the output of a properly corrected autoencoder. First, we can use the fact that, for any distribution, the probability product rule implies that we can always decompose it into the product of its nested conditionals d=1 p(xd | x