# scaling_autoregressive_video_models__d9bbd6f3.pdf Published as a conference paper at ICLR 2020 SCALING AUTOREGRESSIVE VIDEO MODELS Dirk Weissenborn Google Research diwe@google.com Oscar T ackstr om Sana Labs oscar@sanalabs.com Jakob Uszkoreit Google Research usz@google.com Due to the statistical complexity of video, the high degree of inherent stochasticity, and the sheer amount of data, generating natural video remains a challenging task. State-of-the-art video generation models often attempt to address these issues by combining sometimes complex, usually video-specific neural network architectures, latent variable models, adversarial training and a range of other methods. Despite their often high complexity, these approaches still fall short of generating high quality video continuations outside of narrow domains and often struggle with fidelity. In contrast, we show that conceptually simple autoregressive video generation models based on a three-dimensional self-attention mechanism achieve competitive results across multiple metrics on popular benchmark datasets, for which they produce continuations of high fidelity and realism. We also present results from training our models on Kinetics, a large scale action recognition dataset comprised of You Tube videos exhibiting phenomena such as camera movement, complex object interactions and diverse human movement. While modeling these phenomena consistently remains elusive, we hope that our results, which include occasional realistic continuations encourage further research on comparatively complex, large scale datasets such as Kinetics. 1 INTRODUCTION Generative modeling of video holds promise for applications such as content creation, forecasting, transfer learning and model-based reinforcement learning (Srivastava et al., 2015; Carl Vondrick, 2016; Oh et al., 2015; Kaiser et al., 2019). While recently there has been a lot of progress on generative models for text, audio and images, video generation remains challenging. To some extent this is simply due to the large amount of data that needs to be produced. Autoregressive models suffer from this particularly in their generation speed. On the other hand, they have a number of desirable attributes, such as their conceptual simplicity and tractable likelihood, which enables straightforward evaluation of their ability to model the entire data distribution. Moreover, recent results on image generation by Menick & Kalchbrenner (2019) show that pixellevel autoregressive models are capable of generating images with high fidelity. These findings motivate the question of how far one can push such autoregressive models in the more general task of video generation when scaling recent advances in neural architectures to modern hardware accelerators. In this work, we introduce a generalization of the Transformer architecture of Vaswani et al. (2017) using three-dimensional, block-local self-attention. In contrast to the block-local attention mechanism of Parmar et al. (2018), our formulation can be implemented efficiently on Tensor Processing Units, or TPUs (Jouppi et al., 2017). To further reduce the memory footprint, we combine this with a three-dimensional generalization of methods from Menick & Kalchbrenner (2019), who generate images as sequences of smaller, sub-scaled image slices. Together, these techniques allow us to efficiently model videos as 3D volumes instead of sequences of still image frames, with direct interactions between representations of pixels across the spatial and temporal dimensions. Equal contribution. Work done while at Google Research. Published as a conference paper at ICLR 2020 We obtain strong results on popular benchmarks (Section 4.2, Appendix A) and produce high fidelity video continuations on the BAIR robot pushing dataset (Ebert et al., 2017) exhibiting plausible object interactions. Furthermore, our model achieves an almost 50% reduction in perplexity compared to prior work on autoregressive models on another robot pushing dataset. Finally, we apply our models to down-sampled videos from the Kinetics-600 dataset (Carreira et al., 2018) (Section 4.3). While modeling the full range of Kinetics-600 videos still poses a major challenge, we see encouraging video continuations for a more limited subset, namely cooking videos. These feature camera movement, complex object interactions and still cover diverse subjects. We hope that these initial results will encourage future video generation work to evaluate models on more challenging datasets such as Kinetics. 2 RELATED WORK Our setup is closely related to that of Kalchbrenner et al. (2016), who extend work on pixel-level autoregressive image generation (van den Oord et al., 2016b;a) to videos. However, whereas they model the temporal and spatial dimensions separately with dilated convolutions and convolutional LSTMs, respectively, our model is conceptually simpler in that we do not make any distinction between temporal and spatial dimensions and instead rely almost entirely on multi-head self-attention (Vaswani et al., 2017) within the 3D video volume. For comparability, we provide results on Moving MNIST and another robot pushing dataset (Finn et al., 2016a) on which our model achieves an almost 50% reduction in perplexity (see Appendix A). One major drawback of autoregressive models is their notoriously slow generation speed. However, we believe that further research into (partially) parallelizing sampling (Stern et al., 2018) and future hardware accelerators will help alleviate this issue and eventually make autoregressive modeling a viable solution even for extremely high-dimensional data such as videos. To reduce the generally quadratic space complexity of the self-attention mechanism, we use blocklocal self-attention, generalizing the image generation approaches of Parmar et al. (2018) and Chen et al. (2018) to 3D volumes. In concurrent work, Child et al. (2019) instead use sparse attention after linearizing images to a sequence of pixels. To further reduce memory requirements, we generalize sub-scaling (Menick & Kalchbrenner, 2019) to video. An alternative approach is optionally hierarchical multi-scale generation, which has recently been explored for both image generation (Reed et al., 2017; De Fauw et al., 2019) as well as video generation (Mathieu et al., 2016). Earlier work on video generation mostly focused on deterministic approaches (Srivastava et al., 2015; Carl Vondrick, 2016; Xingjian et al., 2015; Liu et al., 2017; Jia et al., 2016), which fail to capture the high degree of stochasticity inherent in video. In response, a popular research direction has been that of generative latent-variable video models. In contrast to pixel-level autoregressive models, these posit an underlying latent process in tandem with the observed pixel values. Work in this category includes variants of variational autoencoders (Babaeizadeh et al., 2018; Denton & Fergus, 2018). To address the issues inherent in these models, most notably the tendency to generate blurry outputs possibly due to restricted modeling power, inadequate prior distributions, or optimization of a lower bound in place of the true likelihood, various directions have been explored, including the use of adversarial objectives (Mathieu et al., 2016; Vondrick et al., 2016; Lee et al., 2018), hierarchical latent-variables (Castrej on et al., 2019), or flow-based models (Kumar et al., 2019). All of these approaches admit significantly faster generation. However, in the adversarial case, they tend to only focus on a subset of the modes in the empirical distribution while flowbased models struggle with limited modeling power even when using a large number of layers and parameters. A large fraction of earlier work on video generation has encoded specific intuitions about videos, such as explicit modeling of motion (Finn et al., 2016b; Denton & Fergus, 2018) or generation of optical flow (P atr aucean et al., 2016). The conceptual simplicity of our model, however, is more in line with recent approaches to video classification that process videos by means of 3D convolutions (Carreira & Zisserman, 2017; Xie et al., 2018) or, similar to this work, spatiotemporal self-attention (Girdhar et al., 2018). Published as a conference paper at ICLR 2020 Padded video (dark is pad) Block Self-attention Conv3D (3x3x3) Stride s = (4,2,2) Masked Conv3D (3x3x3) Subscale Encoder Video slice Subscale Video Transformer Masked Block Self-attention Subscale Decoder Repeat s times Subscale Slices Figure 1: Top: Illustration of the subscale video transformer architecture and process flow. We incrementally generate s = 4 2 2 = 16 video slices. The video slices and their respective generation order are derived from subscaling. In each iteration, we first process the partially padded video (illustrated for slice index (1, 0, 1), black means padding and gray means already generated or visible) by an encoder, the output of which is used as conditioning for decoding the current video slice. After generating a slice we replace the respective padding in the video with the generated output and repeat the process for the next slice. Bottom: Subscaling in 3D (best viewed in color). The 3D volume is evenly divided by a given subscale factor, here s = (4, 2, 2), and the respective slices are extracted. The whole volume is generated by incrementally predicting the individual, much smaller slices, starting at slice x(0,0,0) (yellow), followed by x(0,0,1) (green), x(0,1,0) (red), etc., in rasterscan order. 3 VIDEO TRANSFORMER We generalize the one-dimensional Transformer (Vaswani et al., 2017) to explicitly model videos represented as three-dimensional spatiotemporal volumes, without resorting to sequential linearization of the positions in the volume (Child et al., 2019). This allows for maintaining spatial neighborhoods around positions, which is important as the large number of individual positions to be predicted in a video requires limiting the receptive field of the self-attention mechanism to a neighborhood around every position to avoid the quadratic blow-up in memory consumption of naive fully-connected attention. We model the distribution p(x) over videos x RT H W Nc with time, height, width and channel dimensions, respectively by means of a pixel-channel level autoregressive factorization.1 That is, the joint distribution over pixels is factorized into a product of channel intensities for all Nc channels, for each of the Np = T H W pixels, with respect to an ordering π over pixels: k=0 p(xk π(i)|xπ(