# pixelsnail_an_improved_autoregressive_generative_model__7a095ff4.pdf

Pixel SNAIL: An Improved Autoregressive Generative Model

Xi Chen 1 2 Nikhil Mishra 1 2 Mostafa Rohaninejad 1 2 Pieter Abbeel 1 2

Autoregressive generative models achieve the best results in density estimation tasks involving high dimensional data, such as images or audio. They pose density estimation as a sequence modeling task, where a recurrent neural network (RNN) models the conditional distribution over the next element conditioned on all previous elements. In this paradigm, the bottleneck is the extent to which the RNN can model long-range dependencies, and the most successful approaches rely on causal convolutions. Taking inspiration from recent work in meta reinforcement learning, where dealing with long-range dependencies is also essential, we introduce a new generative model architecture that combines causal convolutions with self attention. In this paper, we describe the resulting model and present state-of-the-art log-likelihood results on heavily benchmarked datasets: CIFAR-10 (2.85 bits per dim), 32 32 Image Net (3.80 bits per dim) and 64 64 Image Net (3.52 bits per dim). Our implementation will be made available at anonymized.

1. Introduction

Autoregressive generative models over high-dimensional data x = (x1, . . . , xn) factor the joint distribution as a product of conditionals:

p(x) = p(x1, . . . , xn) =

i=1 p(xi|x1, . . . , xi 1)

A recurrent neural network (RNN) is then trained to model p(xi|x1:i 1). Optionally, the model can be conditioned on additional global information h (such as a class label, when applied to images), in which case it in models p(xi|x1:i 1, h). Such methods are highly expressive and

1covariant.ai 2UC Berkeley, EECS Dept.. Correspondence to: Xi Chen <peter@covariant.ai>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

allow modeling complex dependencies. Compared to GANs (Goodfellow et al., 2014), neural autoregressive models offer tractable likelihood computation and ease of training, and have been shown to outperform latent variable models (van den Oord et al., 2016c;b; Salimans et al., 2017).

The main design consideration is the neural network architecture used to implement the RNN, as it must be able to easily refer to earlier parts of the sequence. A number of possibilities exist:

Traditional RNNs, such as GRUs or LSTMs: these propagate information by keeping it in their hidden state from one timestep to the next. This temporallylinear dependency signiﬁcantly inhibits the extent to which they can model long-range relationships in the data.

Causal convolutions (van den Oord et al., 2016b; Salimans et al., 2017): these apply convolutions over the sequence (masked or shifted so that the current prediction is only inﬂuenced by previous element). They offer high-bandwidth access to the earlier parts of the sequence. However, their receptive ﬁeld has a ﬁnite size, and still experience noticeable attenuation with regards to elements far away in the sequence.

Self-attention (Vaswani et al., 2017): these models turn the sequence into an unordered key-value store that can be queried based on content. They feature an unbounded receptive ﬁeld and allow undeteriorated access to information far away in the sequence. However, they only offer pinpoint access to small amounts of information, and require additional mechanism to incorporate positional information.

Causal convolutions and self-attention demonstrate complementary strengths and weaknesses: the former allow high bandwidth access over a ﬁnite context size, and the latter allow access over an inﬁnitely large context. Interleaving the two thus offers the best of both worlds, where the model can have high-bandwidth access without constraints on the amount of information it can effectively use. The convolutions can be seen as aggregating information to build the context over which to perform an attentive lookup. Using this approach (dubbed SNAIL), Mishra et al. (2017) demonstrated signiﬁcant performance improvements on a

Pixel SNAIL: An Improved Autoregressive Generative Model

number of tasks in meta-learning setup, where the challenge of long-term temporal dependencies is also prevalent, as an agent should be able to adapt its behavior based on past experience.

In this paper, we consider the task of autoregressive generative modeling by taking inspirations from SNAIL, as the fundamental bottleneck of access to past information is the same. Building off the current state-of-the-art in generative models, a class of convolution-based architectures known as Pixel CNNs (van den Oord et al. (2016b) and Salimans et al. (2017)), we present a new architecture, Pixel SNAIL, that incorporates ideas from (Mishra et al., 2017) to obtain stateof-the-art results on the heavily benchmarked CIFAR-10, Imagenet 32 32 and Imagenet 64 64 datasets.

2. Methodology

For self-containedness, we ﬁrst review the formulation of modeling high-dimensional natural images by neural autoregressive models and describe prior works strengths and weaknesses. Next, we elaborate on the design principles behind Pixel SNAIL and introduce a family of architectures that achieves good performance.

2.1. Neural Autoregressive Image Modeling

Natural images are usually represented as 3-dimensional random variables Height Width 3, where 3 color channels (RGB) are recorded at each location. To model such a random variable autoregressively, one can ﬁrst impose an ordering and then factor the joint distribution as a product of conditionals over that ordering:

p(x) = p(x1, . . . , xn) =

i=1 p(xi|x1, . . . , xi 1)

For natural images, most prior works have chosen to use the raster scan ordering (Oord et al., 2016b; van den Oord et al., 2016b; Salimans et al., 2017), where along each row left pixels come before right pixels and top rows come before bottom rows.

Figure 1. Raster Scan Ordering

State-of-the-art neural autoregressive models employ causal convolution models to represent the conditional distributions (van den Oord et al., 2016b; Salimans et al., 2017). In this type of architecture, the initial image x is processed through a series of causal convolutions and the outcome is a 3D tensor that has shape Height Width Channels, where at each spatial location (x, y) a vector of length Channels describes the sufﬁcient statistics for the conditional p(xi|x i 1)|i=x W idth+y.

In order for the probability model to be valid (and causal), the conditional distribution for xi should only depend on pixel values before i. Such constraints are enforced via either masked convolution (Oord et al., 2016b) or shiftbased convolution (van den Oord et al., 2016b). In masked convolution (illustrated in Figure 2), a normal convolution is applied but the ﬁlter is masked in such a way that it cannot depends on values at current or later pixel locations:

Figure 2. An example masked 5 5 ﬁlter

van den Oord et al. (2016b) pointed out that masked convolutions, though causal, are limited in terms of expressiveness since they create blind spots in the receptive ﬁeld. To address this problem, they introduced shift-based convolutions: at each layer, ordinary convolutions are applied, and then the whole feature map is shifted to maintain causality.

One of the beneﬁts of using causal convolution architectures is that, given a single image x, all the conditional distributions can be calculated in just one forward pass. Since all conditionals are calculated in parallel through highly optimized convolution operations, causal convolution architectures are efﬁcient and scalable to high-dimensional density modeling problems.

However, convolution operations, by nature, only aggregate information locally. In order to model long-range dependencies,the receptive ﬁeld must grow by repeatedly applying convolutions. Noticing this problem, van den Oord et al. (2016a) and Salimans et al. (2017) respectively proposed to use dilated convolutions and strided convolutions (followed by corresponding upsampling) to achieve faster receptive ﬁeld growth. The resulting improvements in densityestimation performance suggest that improving the model s ability to capture long-range dependencies is essential.

Pixel SNAIL: An Improved Autoregressive Generative Model

One should note that, even with dilated convolutions or strided convolutions, information access to remote pixel locations is still limited: the information needs to be relayed through a series of intermediate locations since each convolution operation only operates in a limited context. We will explore architectural decisions that offer better information access to pixels far away from any conditional distribution and show that the improved ability to model long-range statistics leads to better density modelling performance.

Even though, all prior works use raster scan ordering (to the best of our knowledge), it s worth noting that any ordering is equivalent in expressiveness: for any arbitrary ordering, the joint distribution over x can be expressed as a product of conditionals. However, for particular ordering choices, the conditional p(xi|x1, . . . , xi 1) might be a complex distribution that our current modeling tools, like convolutional networks, are incapable of expressing. As such, it could be beneﬁcial to explore other orderings that can give rise to conditional distributions that are easier to learn.

We know that the conditional distribution of a pixel location is mostly inﬂuenced by the values of its neighboring pixels (Salimans et al., 2017) but the widely used raster scan ordering only has a small number neighboring pixels available in the conditioning context x1, . . . , xi 1: only to the left and above and most of the context is wasted on regions that might have little correlation with the current pixel like the far top-right corner. One possible alternative is zigzag ordering, which allows each conditional distribution to depend on pixels to the left and above:

Figure 3. Zigzag Ordering

However, one will notice that when such an ordering will introduce blind spots when combined with combined masked or shift-based convolutional architectures.

Motivated by these issues, we introduce the Pixel SNAIL model family, which generalizes the causal convolution architectures discussed thus far by allowing a much larger and more ﬂexible receptive ﬁeld. As a result, Pixel SNAIL models achieve superior modeling performance.

2.2. Pixel SNAIL

The key idea behind Pixel SNAIL is to introduce attention blocks, in a style similar to Self Attention in (Vaswani et al., 2017; Mishra et al., 2017), into neural autoregressive modelling. As explained previously, the ability to model longrange dependencies is crucial to performance, so it s natural to use attention blocks to equip all conditionals with the ability to refer to all of their available context.

An attention block applies one key-value lookup for the feature vector at every spatial location and the lookups are done for all spatial locations in parallel to exploit GPU parallelism.

Concretely, an attention block that has type H W C1 H W C2 deﬁnes 3 functions that operate on feature vectors:

fkey(x) :: C1 Dimkey

fquery(x) :: C1 Dimkey

fvalue(x) :: C1 C2

According to some autoregressive ordering, we can name the feature vectors of a 2D feature map, y, as y1, y2, , y N. Then for z = attention(y), the mapping is deﬁned as:

j<i pijfvalue(yj)

pi = softmax([fkey(y1)T fquery(yi), , fkey(yi 1)T fquery(yi)])

Each conditional can access any pixels in its context through the attention operator (notice the summation over all available context: P

j<i), easy information access of remote pixels improves modeling of long-range statistics.

Note also that the autoregressive ordering is enforced only in the summation step and hence, in implementation, one can simply mask out entries that shouldn t be summed over to make the operator causal. This kind of masking scheme is also very ﬂexible and can permit, for instance, the Zigzag ordering discussed above.

Pixel SNAIL: An Improved Autoregressive Generative Model

inputs, shape [B, H, W, D]

2x2 causal conv D lters

sigmoid activation

ELU activation

2x2 causal conv D lters

ELU activation

2x2 causal conv D lters

elementwise mul add

repeat R times

outputs, shape [B, H, W, D]

(a) Residual Block (D lters, R repeats)

inputs, shape [B, H, W, D]

1x1 conv, K lters (query)

1x1 conv, K lters (keys)

1x1 conv, V lters (values)

matmul, masked softmax

outputs, shape [B, H, W, V]

(b) Attention Block (key size K, value size V)

Figure 4. The modular components that make up Pixel SNAIL: (a) a residual block, and (b) an attention block. For both datasets, we used residual blocks with 256 ﬁlters and 4 repeats, and attention blocks with key size 16 and value size 128.

The Pixel SNAIL model family is primarily composed of two building blocks, which are illustrated in Figure 4 and described below:

A residual block applies several 2D-convolutions to its input, each with residual connections. To make them causal, the convolutions are masked or shifted so that the current pixel can only access pixels to the left and above from it. We use a gated activation function similar to (van den Oord et al., 2016b; Oord et al., 2016a). Throughout the model, we used 4 convolutions per block and 256 ﬁlters in each convolution.

An attention block performs a single key-value lookup. It projects the input to a lower dimensionality to produce the keys and values and then uses softmaxattention like in (Vaswani et al., 2017; Mishra et al., 2017) (masked so that the current pixel can only attend over previously generated pixels). We used keys of size 16 and values of size 128.

Figure 5 illustrates the full Pixel SNAIL architecture, which interleaves the residual blocks and attention blocks depicted in Figure 4. In the CIFAR-10 model only, we applied dropout of 0.5 after the ﬁrst convolution in every residual block, to prevent overﬁtting. We did not use any dropout for

Image Net, as the dataset is much larger. On both datasets, we use Polyak averaging (Polyak & Juditsky, 1992) (following (Salimans et al., 2017)) over the training parameters. We used an exponential moving average weight of 0.9995 for CIFAR-10 and 0.9997 for Image Net. As the output distribution, we use the discretized mixture of logistics introduced by Salimans et al. (2017), with 10 mixture components for CIFAR-10 and 32 for Image Net. To predict the subpixel (red,green,blue) values, we used the same linearautoregressive parametrization as Salimans et al. (2017).

To mitigate the problems of bad initialization, we employ weight Normalization with data-dependent initialization (Salimans & Kingma, 2016) in all experiments.

Our code will be made available, and can be found at: https://github.com/neocxi/ pixelsnail-public.

3. Experiments

3.1. Long-range Dependency

In the previous section, we hypothesize that attention blocks make it easier to access information from a large context than causal convolutions. Here we conduct a simple diagnostic experiment to investigate this hypothesis. We choose

Pixel SNAIL: An Improved Autoregressive Generative Model

outputs, shape [B, H, W, 10*K]

ELU 1x1 conv, 256 lters ELU

Residual Block 256 lters 4 repeats

ELU 1x1 conv, 256 lters ELU

Attention Block key size 16 value size 128

ELU 1x1 conv, 256 lters ELU

concat (channelwise)

repeat M times

1x1 conv, 10*K lters ELU

inputs, shape [B, H, W, 3]

2x2 causal conv 256 lters

Pixel SNAIL: M blocks, K mixture components

Figure 5. The entire Pixel SNAIL model architecture, using the building blocks from Figure 4. We used 12 blocks for both datasets, with 10 mixture components for CIFAR-10 and 32 for Image Net.

a conditional distribution p(x15,15| ) at the center of the image, and we calculate the log probability s sensitivity to input image for a randomly initialized model:

x log p(x(15,15)| )

This test captures the ﬁrst-order dependency between the inspected conditional distribution and all pixels in its conditioning context. We expect the gradient to be nonzero for pixel values that have an inﬂuence on the conditional distribution.

In Figure 6, we inspect a shift-based causal convolution model (Gated Pixel CNN (van den Oord et al., 2016b)). We used a medium size model with 7 gated resnet blocks and 6M parameters:

Figure 6. Gated Pixel CNN. The yellow dot indicates the pixel under inspection, and dark purple dots indicate locations with derivative magnitude larger than 0.001.

We can observe that the receptive ﬁeld is limited. On top of that, the holes within the theoretical receptive ﬁeld limit attest to the the difﬁculty with which information propagates through long distances in this type of architecture.

Then we run the same experiment on Pixel CNN++ (Salimans et al., 2017) with identical number of resnet blocks.

Figure 7. Pixel CNN++

Here one can see that the strided convolutions introduced in Pixel CNN++ effectively expand the receptive ﬁeld. However, there are still holes within the theoretical limit, suggesting similar difﬁculty of information propogation.

Next we run the same test on a Pixel SNAIL with 7 blocks (4 shift-based convolutions and 3 attention blocks).

Pixel SNAIL: An Improved Autoregressive Generative Model

Figure 8. Pixel SNAIL

Pixel SNAIL achieves a much larger effective receptive ﬁeld size for the same number of layers and fewer holes within theoretical receptive ﬁeld limit. While this doesn t mean that a Pixel SNAIL model relies on all of the available context at convergence, the existence of gradient signal means gradient descent can in principle guide a Pixel SNAIL model to capture learn long-range dependencies.

We would like to stress that these tests are not conclusive. It s possible that other models have higher order dependency that s not visualized by gradient magnitude and it s also conceivable that during training, the effective receptive ﬁelds would change. We nevertheless believe that they provide valuable insights into the inductive biases encoded by different model architectures.

Lastly, we provide a visualization of the receptive ﬁeld of a Pixel SNAIL model that uses the Zigzag ordering instead of the raster scan ordering. This modiﬁcation only required us to change 2 lines of code, but yet we see that Pixel SNAIL is able to approach theoretical receptive ﬁeld limit, despite the drastically different context shape.

Figure 9. Pixel SNAIL with Zigzag Ordering

3.2. Density Modelling Performance

In Table 1, we provide negative log-likelihood results (in bits per dim) for Pixel SNAIL on both CIFAR-10, Imagenet 32 32 and Imagenet 64 64. We compare Pixel SNAIL s performance to a number of autoregressive models. These include: (i) Pixel RNN (Oord et al., 2016b), which uses

LSTMs, (ii) Pixel CNN (van den Oord et al., 2016b) and Pixel CNN++ (Salimans et al., 2017), which only use causal convolutions, and (iii) Image Transformer (Anonymous, 2018), an attention-only architecture inspired by Vaswani et al. (2017). Pixel SNAIL outperforms all of these approaches, which suggests that both causal convolutions and attention are essential components of the architecture. To maintain consistency with prior work, all of the Pixel SNAIL results reported below use the raster scan ordering.

We would like to point out that, as of submission, the performance of Image Net 32 32 and Image Net 64 64 is still improving. Due to computational limits, we can only train these models on 4 GPUs but are able to outperform the previous state-of-the-art model that was trained on 32 GPUs (van den Oord et al., 2016b).

Figure 10. Samples from our CIFAR-10 model.

Figure 11. Samples from our 32 32 Image Net model.

Pixel SNAIL: An Improved Autoregressive Generative Model

Table 1. Average negative log-likelihoods on CIFAR-10 and Image Net 32 32, in bits per dim. Pixel SNAIL outperforms other autoregressive models which only rely on causal convolutions xor self-attention.

Method CIFAR-10 Image Net 32 32 Image Net 64 64

Conv DRAW (Gregor et al., 2016) 3.5 4.40 4.10 Real NVP (Dinh et al., 2016) 3.49 4.28 3.98 VAE with IAF (Kingma et al., 2016) 3.11 Pixel RNN (Oord et al., 2016b) 3.00 3.86 3.63 Gated Pixel CNN (van den Oord et al., 2016b) 3.03 3.83 3.57 Image Transformer (Anonymous, 2018) 2.98 3.81 Pixel CNN++ (Salimans et al., 2017) 2.92 Block Sparse Pixel CNN++ (Open AI, 2017) 2.90

Pixel SNAIL (ours) 2.85 3.80 3.52

4. Related Work

There is a large body of work on neural autoregressive models (Larochelle & Murray, 2011; Uria et al., 2013; Germain et al., 2015; Theis & Bethge, 2015; Oord et al., 2016b; van den Oord et al., 2016d). This type of autoregressive model was further explored for audio data (Oord et al., 2016a), video data (Kalchbrenner et al., 2016b) and language (Kalchbrenner et al., 2016a).

Within the domain of modelling natural images, different extensions to neural autoregressive models were also explored: (Reed et al., 2017) explored mixing parallel generation into autoregressive ordering to speed up generation time, but still mostly rely on raster scan ordering; (Kolesnikov & Lampert, 2017) proposed to decompose modeling of colorful images into two stages of ﬁrst modeling grayscale images and then colorful images.

Other than autoregressive models, there are other natural generative models that provide tractable and exact likelihood computation. These models (Dinh et al., 2014; Rezende & Mohamed, 2015; Dinh et al., 2016) typically apply an invertible transformation that admits tractable determinant calculation to some continuous entropy source. It s worth noting that the architectural improvements proposed in this paper apply equally well to these invertible transformation and we believe it s an exciting area of future research.

Other than exact likelihood models, there are also a lot of works that model natural images with a Helmholtz Machine (Dayan et al., 1995) or variants thereof (Kingma & Welling, 2013; Rezende et al., 2014; de Freitas et al., 2001; Gregor et al., 2015b;a; Tran et al., 2015; Kingma et al., 2016) trained with approximate inference. And there is also a line of work that combined Helmholtz Machines with autoregressive models for images (Chen et al., 2016; Gulrajani et al., 2016) and for text data (Chung et al., 2015; Bowman et al., 2015; Fraccaro et al., 2016; Xu & Sun, 2016).

Among the implicit generative models that are trained with-

out likelihood, GANs (Goodfellow et al., 2014) are the most popular models and generate the most realistic images. We refer readers to (Goodfellow, 2016) for a recent survey on this topic. GANs, based predominantly on CNNs, typically generate images that have realistic local texture but lack global coherence. It s possible that a Pixel SNAIL style architecture will be able to improve global coherence.

Our work directly builds on top of (Mishra et al., 2017), which employs causal convolutions styled after Oord et al. (2016a) with self-attention (like in (Vaswani et al., 2017)) in the context of meta reinforcement learning (where the challenging of capturing long-range dependencies is also prevalent). Although Vaswani et al. (2017) utilized attention in the context of machine translation, a number of concurrent works have used the same technique in other domains. Wang et al. (2017) apply attention to video classiﬁcation and activity recognition, and (Anonymous, 2018) use it for generative modeling of images. Pixel SNAIL signiﬁcantly outperforms the latter, which corroborates the ﬁndings in Mishra et al. (2017) that convolutions and attention complement each other well for modelling long-term dependencies.

5. Conclusion

We introduced Pixel SNAIL, a class of autoregressive generative models that combine causal convolutions with self-attention. We demonstrate state-of-the-art density estimation performance on CIFAR-10, Image Net 32 32 and Image Net 64 64, with a publicly-available implementation at https://github.com/neocxi/ pixelsnail-public.

Despite their tractable likelihood and strong empirical performance, one notable drawback of autoregressive generative models is that sampling is slow.Pixel SNAIL s sampling speed is comparable to that of existing autoregressive models; the design of models that allow faster sampling without losing performance remains an open problem.

Pixel SNAIL: An Improved Autoregressive Generative Model

Anonymous. Image transformer. Under review at the International Conference on Learning Representations (ICLR), 2018. URL https://openreview.net/forum? id=r16Vyf-0-.

Bowman, Samuel R, Vilnis, Luke, Vinyals, Oriol, Dai, Andrew M, Jozefowicz, Rafal, and Bengio, Samy. Generating sentences from a continuous space. ar Xiv preprint ar Xiv:1511.06349, 2015.

Chen, Xi, Kingma, Diederik P, Salimans, Tim, Duan, Yan, Dhariwal, Prafulla, Schulman, John, Sutskever, Ilya, and Abbeel, Pieter. Variational lossy autoencoder. ar Xiv preprint ar Xiv:1611.02731, 2016.

Chung, Junyoung, Kastner, Kyle, Dinh, Laurent, Goel, Kratarth, Courville, Aaron C, and Bengio, Yoshua. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980 2988, 2015.

Dayan, Peter, Hinton, Geoffrey E, Neal, Radford M, and Zemel, Richard S. The helmholtz machine. Neural computation, 7(5):889 904, 1995.

de Freitas, Nando, Højen-Sørensen, Pedro, Jordan, Michael I, and Russell, Stuart. Variational mcmc. In Proceedings of the Seventeenth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 01, pp. 120 127, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-800-1.

Dinh, Laurent, Krueger, David, and Bengio, Yoshua. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014.

Dinh, Laurent, Sohl-Dickstein, Jascha, and Bengio, Samy. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016.

Fraccaro, Marco, Sønderby, Søren Kaae, Paquet, Ulrich, and Winther, Ole. Sequential neural models with stochastic layers. ar Xiv preprint ar Xiv:1605.07571, 2016.

Germain, Mathieu, Gregor, Karol, Murray, Iain, and Larochelle, Hugo. Made: Masked autoencoder for distribution estimation. ar Xiv preprint ar Xiv:1502.03509, 2015.

Goodfellow, Ian. Nips 2016 tutorial: Generative adversarial networks. ar Xiv preprint ar Xiv:1701.00160, 2016.

Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Gregor, Karol, Danihelka, Ivo, Graves, Alex, and Wierstra, Daan. Draw: A recurrent neural network for image generation. ar Xiv preprint ar Xiv:1502.04623, 2015a.

Gregor, Karol, Danihelka, Ivo, Graves, Alex, and Wierstra, Daan. Draw: A recurrent neural network for image generation. In Proceedings of the 32nd International Conference on Machine Learning, 2015b.

Gregor, Karol, Besse, Frederic, Rezende, Danilo Jimenez, Danihelka, Ivo, and Wierstra, Daan. Towards conceptual compression. ar Xiv preprint ar Xiv:1604.08772, 2016.

Gulrajani, Ishaan, Kumar, Kundan, Ahmed, Faruk, Taiga, Adrien Ali, Visin, Francesco, Vazquez, David, and Courville, Aaron. Pixelvae: A latent variable model for natural images. ar Xiv preprint ar Xiv:1611.05013, 2016.

Kalchbrenner, Nal, Espheholt, Lasse, Simonyan, Karen, Oord, Aaron van den, Graves, Alex, and Kavukcuoglu, Koray. eural machine translation in linear time. ar Xiv preprint ar Xiv:1610.00527, 2016a.

Kalchbrenner, Nal, Oord, Aaron van den, Simonyan, Karen, Danihelka, Ivo, Vinyals, Oriol, Graves, Alex, and Kavukcuoglu, Koray. Video pixel networks. ar Xiv preprint ar Xiv:1610.00527, 2016b.

Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes. Proceedings of the 2nd International Conference on Learning Representations, 2013.

Kingma, Diederik P., Salimans, Tim, Jozefowicz, Rafal, Chen, Xi, Sutskever, Ilya, and Welling, Max. Improving variational inference with inverse autoregressive ﬂow. In Advances in Neural Information Processing Systems, 2016.

Kolesnikov, Alexander and Lampert, Christoph H. Pixelcnn models with auxiliary variables for natural image modeling. In International Conference on Machine Learning, pp. 1905 1914, 2017.

Larochelle, Hugo and Murray, Iain. The Neural Autoregressive Distribution Estimator. AISTATS, 2011.

Mishra, Nikhil, Rohaninejad, Mostafa, Chen, Xi, and Abbeel, Pieter. A simple neural attentive meta-learner. In NIPS 2017 Workshop on Meta-Learning, 2017.

Oord, Aaron van den, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior, Andrew, and Kavukcuoglu, Koray. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016a.

Oord, Aaron van den, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. International Conference on Machine Learning (ICML), 2016b.

Pixel SNAIL: An Improved Autoregressive Generative Model

Open AI. Block-sparse gpu kernels, Dec 2017. URL https://blog.openai.com/ block-sparse-gpu-kernels/.

Polyak, Boris T and Juditsky, Anatoli B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992.

Reed, Scott E., van den Oord, A aron, Kalchbrenner, Nal, G omez, Sergio, Wang, Ziyu, Belov, Dan, and de Freitas, Nando. Parallel multiscale autoregressive density estimation. In Proceedings of The 34th International Conference on Machine Learning, 2017.

Rezende, Danilo and Mohamed, Shakir. Variational inference with normalizing ﬂows. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1530 1538, 2015.

Rezende, Danilo J, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. In ICML, pp. 1278 1286, 2014.

Salimans, Tim and Kingma, Diederik P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. ar Xiv preprint ar Xiv:1602.07868, 2016.

Salimans, Tim, Karpathy, Andrej, Chen, Xi, and Kingma, Diederik P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modiﬁcations. ar Xiv preprint ar Xiv:1701.05517, 2017.

Theis, Lucas and Bethge, Matthias. Generative image modeling using spatial lstms. In Advances in Neural Information Processing Systems, pp. 1927 1935, 2015.

Tran, Dustin, Ranganath, Rajesh, and Blei, David M. Variational gaussian process. ar Xiv preprint ar Xiv:1511.06499, 2015.

Uria, Benigno, Murray, Iain, and Larochelle, Hugo. Rnade: The real-valued neural autoregressive density-estimator. In Advances in Neural Information Processing Systems, pp. 2175 2183, 2013.

van den Oord, Aaron, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior, Andrew, and Kavukcuoglu, Koray. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016a.

van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse, Vinyals, Oriol, Graves, Alex, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (NIPS), 2016b.

van den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. In International Conference on Machine Learning (ICML), 2016c.

van den Oord, Aaron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with pixelcnn decoders. ar Xiv preprint ar Xiv:1606.05328, 2016d.

Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, Lukasz, and Polosukhin, Illia. Attention is all you need. ar Xiv preprint ar Xiv:1706.03762, 2017.

Wang, Xiaolong, Girshick, Ross, Gupta, Abhinav, and He, Kaiming. Non-local neural networks. ar Xiv preprint ar Xiv:1711.07971, 2017.

Xu, Weidi and Sun, Haoze. Semi-supervised variational autoencoders for sequence classiﬁcation. ar Xiv preprint ar Xiv:1603.02514, 2016.