# pixelsnail_an_improved_autoregressive_generative_model__7a095ff4.pdf Pixel SNAIL: An Improved Autoregressive Generative Model Xi Chen 1 2 Nikhil Mishra 1 2 Mostafa Rohaninejad 1 2 Pieter Abbeel 1 2 Autoregressive generative models achieve the best results in density estimation tasks involving high dimensional data, such as images or audio. They pose density estimation as a sequence modeling task, where a recurrent neural network (RNN) models the conditional distribution over the next element conditioned on all previous elements. In this paradigm, the bottleneck is the extent to which the RNN can model long-range dependencies, and the most successful approaches rely on causal convolutions. Taking inspiration from recent work in meta reinforcement learning, where dealing with long-range dependencies is also essential, we introduce a new generative model architecture that combines causal convolutions with self attention. In this paper, we describe the resulting model and present state-of-the-art log-likelihood results on heavily benchmarked datasets: CIFAR-10 (2.85 bits per dim), 32 32 Image Net (3.80 bits per dim) and 64 64 Image Net (3.52 bits per dim). Our implementation will be made available at anonymized. 1. Introduction Autoregressive generative models over high-dimensional data x = (x1, . . . , xn) factor the joint distribution as a product of conditionals: p(x) = p(x1, . . . , xn) = i=1 p(xi|x1, . . . , xi 1) A recurrent neural network (RNN) is then trained to model p(xi|x1:i 1). Optionally, the model can be conditioned on additional global information h (such as a class label, when applied to images), in which case it in models p(xi|x1:i 1, h). Such methods are highly expressive and 1covariant.ai 2UC Berkeley, EECS Dept.. Correspondence to: Xi Chen . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). allow modeling complex dependencies. Compared to GANs (Goodfellow et al., 2014), neural autoregressive models offer tractable likelihood computation and ease of training, and have been shown to outperform latent variable models (van den Oord et al., 2016c;b; Salimans et al., 2017). The main design consideration is the neural network architecture used to implement the RNN, as it must be able to easily refer to earlier parts of the sequence. A number of possibilities exist: Traditional RNNs, such as GRUs or LSTMs: these propagate information by keeping it in their hidden state from one timestep to the next. This temporallylinear dependency significantly inhibits the extent to which they can model long-range relationships in the data. Causal convolutions (van den Oord et al., 2016b; Salimans et al., 2017): these apply convolutions over the sequence (masked or shifted so that the current prediction is only influenced by previous element). They offer high-bandwidth access to the earlier parts of the sequence. However, their receptive field has a finite size, and still experience noticeable attenuation with regards to elements far away in the sequence. Self-attention (Vaswani et al., 2017): these models turn the sequence into an unordered key-value store that can be queried based on content. They feature an unbounded receptive field and allow undeteriorated access to information far away in the sequence. However, they only offer pinpoint access to small amounts of information, and require additional mechanism to incorporate positional information. Causal convolutions and self-attention demonstrate complementary strengths and weaknesses: the former allow high bandwidth access over a finite context size, and the latter allow access over an infinitely large context. Interleaving the two thus offers the best of both worlds, where the model can have high-bandwidth access without constraints on the amount of information it can effectively use. The convolutions can be seen as aggregating information to build the context over which to perform an attentive lookup. Using this approach (dubbed SNAIL), Mishra et al. (2017) demonstrated significant performance improvements on a Pixel SNAIL: An Improved Autoregressive Generative Model number of tasks in meta-learning setup, where the challenge of long-term temporal dependencies is also prevalent, as an agent should be able to adapt its behavior based on past experience. In this paper, we consider the task of autoregressive generative modeling by taking inspirations from SNAIL, as the fundamental bottleneck of access to past information is the same. Building off the current state-of-the-art in generative models, a class of convolution-based architectures known as Pixel CNNs (van den Oord et al. (2016b) and Salimans et al. (2017)), we present a new architecture, Pixel SNAIL, that incorporates ideas from (Mishra et al., 2017) to obtain stateof-the-art results on the heavily benchmarked CIFAR-10, Imagenet 32 32 and Imagenet 64 64 datasets. 2. Methodology For self-containedness, we first review the formulation of modeling high-dimensional natural images by neural autoregressive models and describe prior works strengths and weaknesses. Next, we elaborate on the design principles behind Pixel SNAIL and introduce a family of architectures that achieves good performance. 2.1. Neural Autoregressive Image Modeling Natural images are usually represented as 3-dimensional random variables Height Width 3, where 3 color channels (RGB) are recorded at each location. To model such a random variable autoregressively, one can first impose an ordering and then factor the joint distribution as a product of conditionals over that ordering: p(x) = p(x1, . . . , xn) = i=1 p(xi|x1, . . . , xi 1) For natural images, most prior works have chosen to use the raster scan ordering (Oord et al., 2016b; van den Oord et al., 2016b; Salimans et al., 2017), where along each row left pixels come before right pixels and top rows come before bottom rows. Figure 1. Raster Scan Ordering State-of-the-art neural autoregressive models employ causal convolution models to represent the conditional distributions (van den Oord et al., 2016b; Salimans et al., 2017). In this type of architecture, the initial image x is processed through a series of causal convolutions and the outcome is a 3D tensor that has shape Height Width Channels, where at each spatial location (x, y) a vector of length Channels describes the sufficient statistics for the conditional p(xi|x i 1)|i=x W idth+y. In order for the probability model to be valid (and causal), the conditional distribution for xi should only depend on pixel values before i. Such constraints are enforced via either masked convolution (Oord et al., 2016b) or shiftbased convolution (van den Oord et al., 2016b). In masked convolution (illustrated in Figure 2), a normal convolution is applied but the filter is masked in such a way that it cannot depends on values at current or later pixel locations: Figure 2. An example masked 5 5 filter van den Oord et al. (2016b) pointed out that masked convolutions, though causal, are limited in terms of expressiveness since they create blind spots in the receptive field. To address this problem, they introduced shift-based convolutions: at each layer, ordinary convolutions are applied, and then the whole feature map is shifted to maintain causality. One of the benefits of using causal convolution architectures is that, given a single image x, all the conditional distributions can be calculated in just one forward pass. Since all conditionals are calculated in parallel through highly optimized convolution operations, causal convolution architectures are efficient and scalable to high-dimensional density modeling problems. However, convolution operations, by nature, only aggregate information locally. In order to model long-range dependencies,the receptive field must grow by repeatedly applying convolutions. Noticing this problem, van den Oord et al. (2016a) and Salimans et al. (2017) respectively proposed to use dilated convolutions and strided convolutions (followed by corresponding upsampling) to achieve faster receptive field growth. The resulting improvements in densityestimation performance suggest that improving the model s ability to capture long-range dependencies is essential. Pixel SNAIL: An Improved Autoregressive Generative Model One should note that, even with dilated convolutions or strided convolutions, information access to remote pixel locations is still limited: the information needs to be relayed through a series of intermediate locations since each convolution operation only operates in a limited context. We will explore architectural decisions that offer better information access to pixels far away from any conditional distribution and show that the improved ability to model long-range statistics leads to better density modelling performance. Even though, all prior works use raster scan ordering (to the best of our knowledge), it s worth noting that any ordering is equivalent in expressiveness: for any arbitrary ordering, the joint distribution over x can be expressed as a product of conditionals. However, for particular ordering choices, the conditional p(xi|x1, . . . , xi 1) might be a complex distribution that our current modeling tools, like convolutional networks, are incapable of expressing. As such, it could be beneficial to explore other orderings that can give rise to conditional distributions that are easier to learn. We know that the conditional distribution of a pixel location is mostly influenced by the values of its neighboring pixels (Salimans et al., 2017) but the widely used raster scan ordering only has a small number neighboring pixels available in the conditioning context x1, . . . , xi 1: only to the left and above and most of the context is wasted on regions that might have little correlation with the current pixel like the far top-right corner. One possible alternative is zigzag ordering, which allows each conditional distribution to depend on pixels to the left and above: Figure 3. Zigzag Ordering However, one will notice that when such an ordering will introduce blind spots when combined with combined masked or shift-based convolutional architectures. Motivated by these issues, we introduce the Pixel SNAIL model family, which generalizes the causal convolution architectures discussed thus far by allowing a much larger and more flexible receptive field. As a result, Pixel SNAIL models achieve superior modeling performance. 2.2. Pixel SNAIL The key idea behind Pixel SNAIL is to introduce attention blocks, in a style similar to Self Attention in (Vaswani et al., 2017; Mishra et al., 2017), into neural autoregressive modelling. As explained previously, the ability to model longrange dependencies is crucial to performance, so it s natural to use attention blocks to equip all conditionals with the ability to refer to all of their available context. An attention block applies one key-value lookup for the feature vector at every spatial location and the lookups are done for all spatial locations in parallel to exploit GPU parallelism. Concretely, an attention block that has type H W C1 H W C2 defines 3 functions that operate on feature vectors: fkey(x) :: C1 Dimkey fquery(x) :: C1 Dimkey fvalue(x) :: C1 C2 According to some autoregressive ordering, we can name the feature vectors of a 2D feature map, y, as y1, y2, , y N. Then for z = attention(y), the mapping is defined as: j