# video_pixel_networks__c99352c1.pdf Video Pixel Networks Nal Kalchbrenner 1 A aron van den Oord 1 Karen Simonyan 1 Ivo Danihelka 1 Oriol Vinyals 1 Alex Graves 1 Koray Kavukcuoglu 1 We propose a probabilistic video model, the Video Pixel Network (VPN), that estimates the discrete joint distribution of the raw pixel values in a video. The model and the neural architecture reflect the time, space and color structure of video tensors and encode it as a fourdimensional dependency chain. The VPN approaches the best possible performance on the Moving MNIST benchmark, a leap over the previous state of the art, and the generated videos show only minor deviations from the ground truth. The VPN also produces detailed samples on the action-conditional Robotic Pushing benchmark and generalizes to the motion of novel objects. 1. Introduction Video modelling has remained a challenging problem due to the complexity and ambiguity inherent in video data. Current approaches range from mean squared error models based on deep neural networks (Srivastava et al., 2015a; Oh et al., 2015), to models that predict quantized image patches (Ranzato et al., 2014), incorporate motion priors (Patraucean et al., 2015; Finn et al., 2016) or use adversarial losses (Mathieu et al., 2015; Vondrick et al., 2016). Despite the wealth of approaches, future frame predictions that are free of systematic artifacts (e.g. blurring) have been out of reach even on relatively simple benchmarks like Moving MNIST (Srivastava et al., 2015a). We propose the Video Pixel Network (VPN), a generative video model based on deep neural networks, that reflects the factorization of the joint distribution of the pixel values in a video. The model encodes the four-dimensional structure of video tensors and captures dependencies in the time dimension of the data, in the two space dimensions of 1Google Deep Mind, London, UK. Correspondence to: Nal Kalchbrenner . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). each frame and in the color channels of a pixel. This makes it possible to model the stochastic transitions locally from one pixel to the next and more globally from one frame to the next without introducing independence assumptions in the conditional factors. The factorization further ensures that the model stays fully tractable; the likelihood that the model assigns to a video can be computed exactly. The model operates on pixels without preprocessing and predicts discrete multinomial distributions over raw pixel intensities, allowing the model to estimate distributions of any shape. The architecture of the VPN consists of two parts: resolution preserving CNN encoders and Pixel CNN decoders (van den Oord et al., 2016b). The CNN encoders preserve at all layers the spatial resolution of the input frames in order to maximize representational capacity. The outputs of the encoders are combined over time with a convolutional LSTM that also preserves the resolution (Hochreiter & Schmidhuber, 1997; Shi et al., 2015). The Pixel CNN decoders use masked convolutions to efficiently capture space and color dependencies and use a softmax layer to model the multinomial distributions over raw pixel values. The network uses dilated convolutions in the encoders to achieve larger receptive fields and better capture global motion. The network also utilizes newly defined multiplicative units and corresponding residual blocks. We evaluate VPNs on two benchmarks. The first is the Moving MNIST dataset (Srivastava et al., 2015a) where, given 10 frames of two moving digits, the task is to predict the following 10 frames. In Sect. 5 we show that the VPN achieves 87.6 nats/frame, a score that is near the lower bound on the loss (calculated to be 86.3 nats/frame); this constitutes a significant improvement over the previous best result of 179.8 nats/frame (Patraucean et al., 2015). The second benchmark is the Robotic Pushing dataset (Finn et al., 2016) where, given two natural video frames showing a robotic arm pushing objects, the task is to predict the following 18 frames. We show that the VPN not only generalizes to new action sequences with objects seen during training, but also to new action sequences involving novel objects not seen during training. Random samples from the VPN preserve remarkable detail throughout the generated sequence. We also define a baseline model Video Pixel Networks ˆF0 ˆF3 ˆF1 F1 F0 F2 F3 ˆF0 ˆF3 ˆF1 Baseline Video Pixel Network Pixel CNN Decoders CNN Decoders Resolution Preserving CNN Encoders Figure 1. Dependency map (top) and neural network structure (bottom) for the VPN (left) and the baseline model (right). ˆFt denotes the estimated distribution over frame Ft, from which Ft is sampled. Dashed lines denote masked convolutional layers. that lacks the space and color dependencies. Through evaluation we confirm that these dependencies are crucial for avoiding systematic artifacts in generated videos. In this section we define the probabilistic model implemented by Video Pixel Networks. Let a video x be a fourdimensional tensor of pixel values xt,i,j,c, where the first (temporal) dimension t {0, ..., T} corresponds to one of the frames in the video, the next two (spatial) dimensions i, j {0, ..., N} index the pixel at row i and column j in frame t, and the last dimension c {R, G, B} denotes one of the three RGB channels of the pixel. We let each xt,i,j,c be a random variable that takes values from the RGB color intensities of the pixel. By applying the chain rule to factorize the video likelihood p(x) as a product of conditional probabilities, we can model it in a tractable manner and without introducing independence assumptions: j=0 p(xt,i,j,B|x<, xt,i,j,R, xt,i,j,G) p(xt,i,j,G|x<, xt,i,j,R)p(xt,i,j,R|x<). (1) Here x< = x(t,