# pixel_recurrent_neural_networks__f39dba57.pdf Pixel Recurrent Neural Networks A aron van den Oord AVDNOORD@GOOGLE.COM Nal Kalchbrenner NALK@GOOGLE.COM Koray Kavukcuoglu KORAYK@GOOGLE.COM Google Deep Mind Modeling the distribution of natural images is a landmark problem in unsupervised learning. This task requires an image model that is at once expressive, tractable and scalable. We present a deep neural network that sequentially predicts the pixels in an image along the two spatial dimensions. Our method models the discrete probability of the raw pixel values and encodes the complete set of dependencies in the image. Architectural novelties include fast twodimensional recurrent layers and an effective use of residual connections in deep recurrent networks. We achieve log-likelihood scores on natural images that are considerably better than the previous state of the art. Our main results also provide benchmarks on the diverse Image Net dataset. Samples generated from the model appear crisp, varied and globally coherent. 1. Introduction Generative image modeling is a central problem in unsupervised learning. Probabilistic density models can be used for a wide variety of tasks that range from image compression and forms of reconstruction such as image inpainting (e.g., see Figure 1) and deblurring, to generation of new images. When the model is conditioned on external information, possible applications also include creating images based on text descriptions or simulating future frames in a planning task. One of the great advantages in generative modeling is that there are practically endless amounts of image data available to learn from. However, because images are high dimensional and highly structured, estimating the distribution of natural images is extremely challenging. One of the most important obstacles in generative modeling is building complex and expressive models that are Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). occluded completions original Figure 1. Image completions sampled from a Pixel RNN. also tractable and scalable. This trade-off has resulted in a large variety of generative models, each having their advantages. Most work focuses on stochastic latent variable models such as VAE s (Rezende et al., 2014; Kingma & Welling, 2013) that aim to extract meaningful representations, but often come with an intractable inference step that can hinder their performance. One effective approach to tractably model a joint distribution of the pixels in the image is to cast it as a product of conditional distributions; this approach has been adopted in autoregressive models such as NADE (Larochelle & Murray, 2011) and fully visible sigmoid belief networks (Neal, 1992). The factorization turns the joint modeling problem into a sequence problem, where one learns to predict the next pixel given all the previously generated pixels. But to model the highly nonlinear and long-range correlations between pixels and the complex conditional distributions that result, a highly expressive sequence model is necessary. Recurrent Neural Networks (RNN) are powerful models that offer a compact, shared parametrization of a series of conditional distributions. RNNs have been shown to excel at hard sequence problems ranging from handwriting generation (Graves, 2013), to character prediction (Sutskever et al., 2011) and to machine translation (Kalchbrenner & Blunsom, 2013). A two-dimensional RNN has produced very promising results in modeling grayscale images and textures (Theis & Bethge, 2015). In this paper we advance two-dimensional RNNs and apply them to large-scale modeling of natural images. The resulting Pixel RNNs are composed of up to twelve, fast two-dimensional Long Short-Term Memory (LSTM) lay- Pixel Recurrent Neural Networks Multi-scale context Figure 2. Left: To generate pixel xi one conditions on all the previously generated pixels left and above of xi. Center: To generate a pixel in the multi-scale case we can also condition on the subsampled image pixels (in light blue). Right: Diagram of the connectivity inside a masked convolution. In the first layer, each of the RGB channels is connected to previous channels and to the context, but is not connected to itself. In subsequent layers, the channels are also connected to themselves. ers. These layers use LSTM units in their state (Hochreiter & Schmidhuber, 1997; Graves & Schmidhuber, 2009) and adopt a convolution to compute at once all the states along one of the spatial dimensions of the data. We design two types of these layers. The first type is the Row LSTM layer where the convolution is applied along each row; a similar technique is described in (Stollenga et al., 2015). The second type is the Diagonal Bi LSTM layer where the convolution is applied in a novel fashion along the diagonals of the image. The networks also incorporate residual connections (He et al., 2015) around LSTM layers; we observe that this helps with training of the Pixel RNN for up to twelve layers of depth. We also consider a second, simplified architecture which shares the same core components as the Pixel RNN. We observe that Convolutional Neural Networks (CNN) can also be used as sequence model with a fixed dependency range, by using Masked convolutions. The Pixel CNN architecture is a fully convolutional network of fifteen layers that preserves the spatial resolution of its input throughout the layers and outputs a conditional distribution at each location. Both Pixel RNN and Pixel CNN capture the full generality of pixel inter-dependencies without introducing independence assumptions as in e.g., latent variable models. The dependencies are also maintained between the RGB color values within each individual pixel. Furthermore, in contrast to previous approaches that model the pixels as continuous values (e.g., Theis & Bethge (2015); Gregor et al. (2014)), we model the pixels as discrete values using a multinomial distribution implemented with a simple softmax layer. We observe that this approach gives both representational and training advantages for our models. The contributions of the paper are as follows. In Section 3 we design two types of Pixel RNNs corresponding to the two types of LSTM layers; we describe the purely convolutional Pixel CNN that is our fastest architecture; and we design a Multi-Scale version of the Pixel RNN. In Section 5 we show the relative benefits of using the discrete softmax distribution in our models and of adopting residual connections for the LSTM layers. Next we test the models on MNIST and on CIFAR-10 and show that they obtain loglikelihood scores that are considerably better than previous results. We also provide results for the large-scale Image Net dataset resized to both 32 32 and 64 64 pixels; to our knowledge likelihood values from generative models have not previously been reported on this dataset. Finally, we give a qualitative evaluation of the samples generated from the Pixel RNNs. Our aim is to estimate a distribution over natural images that can be used to tractably compute the likelihood of images and to generate new ones. The network scans the image one row at a time and one pixel at a time within each row. For each pixel it predicts the conditional distribution over the possible pixel values given the scanned context. Figure 2 illustrates this process. The joint distribution over the image pixels is factorized into a product of conditional distributions. The parameters used in the predictions are shared across all pixel positions in the image. To capture the generation process, Theis & Bethge (2015) propose to use a two-dimensional LSTM network (Graves & Schmidhuber, 2009) that starts at the top left pixel and proceeds towards the bottom right pixel. The advantage of the LSTM network is that it effectively handles long-range dependencies that are central to object and scene understanding. The two-dimensional structure ensures that the signals are well propagated both in the left-to-right and topto-bottom directions. In this section we first focus on the form of the distribution, whereas the next section will be devoted to describing the architectural innovations inside Pixel RNN. 2.1. Generating an Image Pixel by Pixel The goal is to assign a probability p(x) to each image x formed of n n pixels. We can write the image x as a onedimensional sequence x1, ..., xn2 where pixels are taken from the image row by row. To estimate the joint distribution p(x) we write it as the product of the conditional distributions over the pixels: i=1 p(xi|x1, ..., xi 1) (1) The value p(xi|x1, ..., xi 1) is the probability of the i-th pixel xi given all the previous pixels x1, ..., xi 1. The generation proceeds row by row and pixel by pixel. Figure 2 (Left) illustrates the conditioning scheme. Each pixel xi is in turn jointly determined by three values, Pixel Recurrent Neural Networks Figure 3. In the Diagonal Bi LSTM, to allow for parallelization along the diagonals, the input map is skewed by offseting each row by one position with respect to the previous row. When the spatial layer is computed left to right and column by column, the output map is shifted back into the original size. The convolution uses a kernel of size 2 1. one for each of the color channels Red, Green and Blue (RGB). We rewrite the distribution p(xi|x