# deep_generative_video_compression__930399e2.pdf Deep Generative Video Compression Jun Han Dartmouth College junhan@cs.dartmouth.edu Salvator Lombardo Disney Research LA salvator.d.lombardo@disney.com Christopher Schroers Disney Research|Studios christopher.schroers@disney.com Stephan Mandt University of California, Irvine mandt@uci.edu The usage of deep generative models for image compression has led to impressive performance gains over classical codecs while neural video compression is still in its infancy. Here, we propose an end-to-end, deep generative modeling approach to compress temporal sequences with a focus on video. Our approach builds upon variational autoencoder (VAE) models for sequential data and combines them with recent work on neural image compression. The approach jointly learns to transform the original sequence into a lower-dimensional representation as well as to discretize and entropy code this representation according to predictions of the sequential VAE. Rate-distortion evaluations on small videos from public data sets with varying complexity and diversity show that our model yields competitive results when trained on generic video content. Extreme compression performance is achieved when training the model on specialized content. 1 Introduction The transmission of video content is responsible for up to 80% of the consumer internet traffic, and both the overall internet traffic as well as the share of video data is expected to increase even further in the future (Cisco, 2017). Improving compression efficiency is more crucial than ever. The most commonly used standard is H.264 (Wiegand et al., 2003); more recent codecs include H.265 (Sullivan et al., 2012) and VP9 (Mukherjee et al., 2015). All of these existing codecs follow the same block based hybrid structure (Musmann et al., 1985) which essentially emerged from engineering out and refining this concept over decades. From a high level perspective, they differ in a huge number of smaller design choices and have grown to become more and more complex systems. While there is room for improving the block based hybrid approach even further (Fraunhofer, 2018), the question remains as to how much longer significant improvements can be obtained while following the same paradigm. In the context of image compression, deep learning approaches that are fundamentally different to existing codecs have already shown promising results (Ballé et al., 2018, 2016; Theis et al., 2017; Agustsson et al., 2017; Minnen et al., 2018). Motivated by these successes for images, we propose a first step towards innovating beyond block-based hybrid codecs by framing video compression in a deep generative modeling context. To this end, we propose an unsupervised deep learning approach to encoding video. The approach simultaneously learns the optimal transformation of the video to a lower-dimensional representation and a powerful predictive model that assigns probabilities to video segments, allowing us to efficiently entropy-code the discretized latent representation into a short code length. Shared first authorship. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. H.265 (21.1 d B @ 0.86 bpp) VP9 (26.0 d B @ 0.57 bpp) Ours (44.6 d B @ 0.06 bpp) Figure 1: Reconstructed video frames using the established codecs H.265 (left), VP9 (middle), and ours (right), with videos taken from the Sprites data set (Section 4). On specialized content as shown here, higher PSNR values in d B (corresponding to lower distortion) can be achieved at almost an order of magnitude smaller bits per pixel (bpp) rates. Compared to the classical codecs, fewer geometrical artifacts are apparent in our approach. Our end-to-end neural video compression scheme is based on sequential variational autoencoders (Bayer & Osendorfer, 2014; Chung et al., 2015; Li & Mandt, 2018). The transformations to and from the latent representation (the encoder and decoder) are parametrized by deep neural networks and are learned by unsupervised training on videos. These latent states have to be discretized before they can be compressed into binary. Ballé et al. (2016) address this problem by using a box-shaped variational distribution with a fixed width, forcing the VAE to forget all information stored on smaller length scales due to the insertion of noise during training. This paper follows the same paradigm for temporally-conditioned distributions. A sequence of quantized latent representations still contains redundant information as the latents are highly correlated. (Lossless) entropy encoding exploits this fact to further reduce the expected file size by expressing likely data in fewer bits and unlikely data in more bits. This requires knowledge of the probability distribution over the discretized data that is to be compressed, which our approach obtains from the sequential prior. Among the many architectural choices that our approach enables, we empirically investigate a model that is well suited for the regime of extreme compression. This model uses a combination of both local latent variables, which are inferred from a single frame, and a global state, inferred from a multiframe segment, to efficiently store a video sequence. The dynamics of the local latent variables are modeled stochastically by a deep generative model. After training, the context-dependent predictive model is used to entropy code the latent variables into binary with an arithmetic coder. In this paper, we focus on low-resolution video (64 64) as the first step towards deep generative video compression. Figure 1 shows a test example of the possible performance improvements using our approach if the model is trained on restricted content (video game characters). The plots show two frames of a video, compressed and reconstructed by our approach and by classical video codecs. One sees that fine granular details, such as the hands of the cartoon character, are lost in the classical approach due to artifacts from block motion estimation (low bitrate regime), whereas our deep learning approach successfully captures these details with less than 10% of the file length. Our contributions are as follows: 1) A general paradigm for generative compression of sequential data. We propose a general framework for compressing sequential data by employing a sequential variational autoencoder (VAE) in conjuction with discretization and entropy coding to build an end-to-end trainable codec. 2) A new neural codec for video compression. We employ the above paradigm towards building an end-to-end trainable codec. To the best of our knowledge, this is the first work to utilize a deep generative video model together with discretization and entropy coding to perform video compression. 3) High compression ratios. We perform experiments on three public data sets of varying complexity and diversity. Performance is evaluated in terms of rate-distortion curves. For the low-resolution videos considered in this paper, our method is competitive with traditional codecs after training and testing on a diverse set of videos. Extreme compression performance can be achieved on a restricted set of videos containing specialized content if the model is trained on similar videos. 4) Efficient compression from a global state. While a deep latent time series model takes temporal redundancies in the video into account, one optional variation of our model architecture tries to compress static information into a separate global variable (Li & Mandt, 2018) which acts similarly as a key frame in traditional methods. We show that this decomposition can be beneficial. Our paper is organized as follows. In Section 2, we summarize related works before describing our method in Section 3. Section 4 discusses our experimental results. We give our conclusions in Section 5. 2 Related Work The approaches related to our method fall into three categories: deep generative video models, neural image compression, and neural video compression. Deep generative video models. Several works have applied the variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) to stochastically model sequences (Bayer & Osendorfer, 2014; Chung et al., 2015). Babaeizadeh et al. (2018); Xu et al. (2020) use a VAE for stochastic video generation. He et al. (2018) and Denton & Fergus (2018) apply a long short term memory (LSTM) in conjunction with a sequential VAE to model the evolution of the latent space across many video frames. Li & Mandt (2018) separate latent variables of a sequential VAE into local and global variables in order to learn a disentangled representation for video generation. Vondrick et al. (2016) generate realistic videos by using a generative adversarial network (Goodfellow et al., 2014) to learn to separate foreground and background, and Lee et al. (2018) combine variational and adversarial methods to generate realistic videos. This paper also employs a deep generative model to model the sequential probability distribution of frames from a video source. In contrast to other work, our method learns a continuous latent representation that can be discretized with minimal information loss, required for further compression into binary. Furthermore, our objective is to convert the original video into a short binary description rather than to generate new videos. Neural image compression. There has been significant work on applying deep learning to image compression. In Toderici et al. (2016, 2017); Johnston et al. (2018), an LSTM based codec is used to model spatial correlations of pixel values and can achieve different bit-rates without having to retrain the model. Ballé et al. (2016) perform image compression with a VAE and demonstrate how to approximately discretize the VAE latent space by introducing noise during training. This work is refined by (Ballé et al., 2018) which improves the prior model (used for entropy coding) beyond the mean-field approximation by transmitting side information in the form of a hierarchical model. Minnen et al. (2018) consider an autoregressive model to achieve a similar effect. Santurkar et al. (2018) studies the performance of generative compression on images and suggests it may be more resilient to bit error rates. These image codecs encode each image independently and therefore their probabilistic models are stationary with respect to time. In contrast, our method performs compression according to a non-stationary, time-dependent probability model which typically has lower entropy per pixel. Neural video compression. The use of deep neural networks for video compression is relatively new. Wu et al. (2018) perform video compression through image interpolation between reference frames using a predictive model based on a deep neural network. Chen et al. (2017) and Chen et al. (2019) use a deep neural architecture to predict the most likely frame with a modified form of block motion prediction and store residuals in a lossy representation. Since these works are based on motion estimation and residuals, they are somewhat similar in function and performance to existing codecs. Lu et al. (2019) and Djelouah et al. (2019) also follow a pipeline based on motion estimation and residual computation as in existing codecs. In contrast, our method is not based on motion estimation, and the full inferred probability distribution over the space of plausible subsequent frames is used for entropy coding the frame sequence (rather than residuals). In a concurrent publication, Habibian et al. (2019) perform video compression by utilizing a 3D variational autoencoder. In this case, the 3D encoder removes temporal redundancy by decorrelating latents, wheras our method uses entropy coding (with time-dependent probabilities) to remove temporal redundancy. 3 Deep Generative Video Compression Our end-to-end approach simultaneously learns to transform a video into a lower-dimensional latent representation and to remove the remaining redundancy in the latents through model-based entropy coding. Section 3.1 gives an overview of the deep generative video coding approach as a whole Quantization Encoding Binarization Input Video Reconstructed Video After training p ( zt | z