# vct_a_video_compression_transformer__af299ea1.pdf VCT: A Video Compression Transformer Fabian Mentzer Google Research mentzer@google.com George Toderici Google Research gtoderici@google.com David Minnen Google Research dminnen@google.com Sung Jin Hwang Google Research sjhwang@google.com Sergi Caelles Google Research scaelles@google.com Mario Lucic Google Research lucic@google.com Eirikur Agustsson Google Research eirikur@google.com We show how transformers can be used to vastly simplify neural video compression. Previous methods have been relying on an increasing number of architectural biases and priors, including motion prediction and warping operations, resulting in complex models. Instead, we independently map input frames to representations and use a transformer to model their dependencies, letting it predict the distribution of future representations given the past. The resulting video compression transformer outperforms previous methods on standard video compression data sets. Experiments on synthetic data show that our model learns to handle complex motion patterns such as panning, blurring and fading purely from data. Our approach is easy to implement, and we release code to facilitate future research. 1 Introduction Neural network based video compression techniques have recently emerged to rival their non-neural counter parts in rate-distortion performance [e.g., 1, 17, 30, 42]. These novel methods tend to incorporate various architectural biases and priors inspired by the classic, non-neural approaches. While many authors tend to draw a line between hand-crafted classical codecs and neural approaches, the neural approaches themselves are increasingly hand-crafted , with authors introducing complex connections between the many sub-components. The resulting methods are complicated, challenging to implement, and constrain themselves to work well only on data that matches the architectural biases. In particular, many methods rely on some form of motion prediction followed by a warping operation [e.g., 1, 17, 19, 23, 42, 41]. These methods warp previous reconstructions with the predicted flow, and calculate a residual. In this paper, we replace flow prediction, warping, and residual compensation, with an elegantly simple but powerful transformer-based temporal entropy model. The resulting video compression transformer (VCT) outperforms previous methods on standard video compression data sets, while being free from their architectural biases and priors. Furthermore, we create synthetic data to explore the effect of architectural biases, and show that we compare favourably to previous approaches on the types videos that the architectural components were designed for (panning on static frames, or blurring), despite our transformer not relying on any of these components. More crucially, we outperform previous approaches on videos that have no obvious matching architectural component (sharpening, fading between scenes), showing the benefit of removing hand-crafted elements and letting a transformer learn everything from data. We use transformers to compress videos in two steps (see Fig. 1): First, using lossy transform coding [3], we map frames xi from image space to quantized representations yi, independently for each frame. From yi we can recover a reconstruction ˆxi. Second, we let a transformer leverage 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Transformer Masked Block Autoregressive Transformer yi, wc). We flatten blocks spatially (raster-scan order, see left arrows) to obtain tokens for the transformer, which remain d C-dimensional since they are just a different view of yi. We show wc=3, wp=5, d C=5, but we use wc=4, wp=8, d C=192 in practice. approach, for example Agustsson et al. [1] introduced the notion of a flow predictor that also supports blurring called Scale Space Flow (SSF), which became a building block for other approaches [42, 30]. Rippel et al. [30] achieved state-of-the-art results by using SSF and more context to predict flow. RNNs and Conv LSTMs were used to build recurrent decoders [13] or entropy models [41]. Some work does not rely on pixel-space flow: Habibian et al. [14] used a 3D autoregressive entropy model, FVC [17] predicted flow in a 2 downscaled feature space, and Liu et al. [20] used a Conv LSTM to predict representations which are transmitted using an iterative quantization scheme. DCVC [19] estimated motion in pixel space but performed residual compensation in a feature space. Liu et al. [21] also losslessly encoded frame-level representations, but rely on CNNs for temporal modelling. Finally, recent work employed GAN losses to increase realism [24, 40]. 3.1 Overview and Background Frame encoding and decoding A high-level overview of our approach is shown in Fig. 1. We split video coding into two parts. First, we independently encode each frame xi into a quantized representation yi= E(xi) using a CNN-based image encoder E followed by quantization to an integer grid. The encoder downscales spatially and increases the channel dimension, resulting in yi being a (H, W, d C)-dimensional feature map, where H, W are 16 smaller than the input image resolution. From yi, we can recover a reconstruction ˆxi using the decoder D. We train E, D using standard neural image compression techniques to be lossy transforms reaching nearly any desired distortion d(xi, ˆxi) by varying how large the range of each element in yi is. For now, let us assume we have a pair E, D reaching a fixed distortion. Naive approach After having lossily converted the sequence of input frames xi to a sequence of representations yi= E(xi) , one can naively store all yi to disk losslessly. To see why this is suboptimal, let each element yi,j of yi be a symbol in S = { L, . . . , L}. Assuming that all |S| symbols appear with equal probability, i.e., P(yi,j) = 1/|S|, one can transmit yi using H W d C log2|S| bits. Using a realistic L=32, this implies that we would need 9 MB, or 2Gbps at 30fps, to encode a single HD frame (where H W d C 1.6M, see Introduction). While arguably inefficient, this is a valid compression scheme which will result in the desired distortion. The aim of this work is to improve this scheme by approximately two orders of magnitude. An efficient coding scheme Given a probability mass function (PMF) P estimating the true distribution Q of symbols in yi, we can use entropy coding (EC) to transmit yi with H W d C Ey Q(yi)[ log2P(y)] bits.2 By using EC, we can encode more frequently occurring values with fewer bits, and hence improve the efficiency. Note that the expectation term representing the average bit count corresponds to the cross-entropy of Q with respect to P. Our main idea is to parameterize P 2Consistent with neural compression literature but in contrast to Information Theory, we use P for the model. bi 2 bi 1 bi Masked Transformer Layer zjoint ... ... ... ... ... ... ... ... ... Transformer Layer Transformer Layer Masked MHSA Masked Transformer Layer Transformer Layer Transformer Layer Project + Pos Project + Pos Project + Pos Temporal Pos Tsep t1 t S t2 t16 ... Already Transmitted To be transmitted Figure 3: The transformer operates on the pink set of blocks/tokens bi 2, bi 1, bi (obtained as shown in Fig. 2). We first extract temporal information zjoint from already transmitted blocks. Tcur is shown predicting P(t3|t S, t1, t2, zjoint), where t S is a learned start token. as a conditional distribution using very flexible transformer models, and to minimize the cross-entropy and thus maximize coding efficiency. We emphasize that we use P for lossless EC, we do not sample from the model to transmit data. Even if the resulting model of P is sub-optimal, yi can still be stored losslessly, albeit inefficiently. Why would one hope to do better than the uniform distribution over yi? In principle, the model should be able to exploit the temporal redundancy across frames, and the spatial consistency within frames. 3.2 Transformer-based Temporal Entropy Model To transmit a video of F frames, x1, . . . , x F , we first map E over each frame obtaining quantized representations y1, . . . , y F . Let s assume we already transmitted y1, . . . , yi 1. To transmit yi, we use the transformer to predict P(yi|yi 2, yi 1). Using this distribution, we entropy code yi to create a compressed, binary representation that can be transmitted. To compress the full video, we simply apply this procedure iteratively, letting the transformer predict P(yj|yj 2, yj 1) for j {1, . . . , F}, padding with zeros when predicting distributions for y1, y2. The receiver follows the same procedure to recover all yj, i.e., it iteratively calculates P(yj|yj 2, yj 1) to entropy decode each yj. After obtaining each representation, y1, y2, . . . , y F , the receiver generates reconstructions. Tokens When processing the current representation yi, we split it spatially into non-overlapping blocks with size wc wc as shown in Fig. 2. Previous representations yi 2, yi 1 become corresponding overlapping wp wp blocks (where wp > wc) to provide both temporal and spatial context for predicting P(yi|yi 2, yi 1). Intuitively, the larger spatial extent provides useful context to predict the distribution of the current block. Note that all blocks span a relatively large spatial region in image space due to the downscaling convolutional encoder E. We flatten each block spatially (see Fig. 2) to obtain tokens for the transformers. The transformers then run independently on corresponding blocks/tokens, i.e., tokens of the same color in Fig. 2 get processed together, trading reduced spatial context for parallel execution.3 This independence assumption allows us to focus on a single set of blocks, e.g., the pink blocks in Fig. 2. In the following text and in Fig. 3, we thus show how we predict distributions for the w2 c =16 tokens t1, t2, . . . , t16 in block bi, given the 2w2 p=128 tokens from the previous blocks bi 2, bi 1. Step 1: Temporal Mixer We use two transformers to extract temporal information from bi 2, bi 1. A first transformer Tsep operates separately on each previous block. Then, we concatenate the outputs in the token dimension and run the second transformer, Tjoint, on the result to mix information across time. The output zjoint is 2w2 p features, containing everything the model knows about the past. Step 2: Within-Block-Autoregression The second part of our method is the masked transformer Tcur, which predicts PMFs for each token using auto-regression within the block. We obtain a 3As a side benefit, the number of tokens for the transformers is not a function of image resolution, unlike Vi T-based approaches [10]. 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 32 PSNR on MCL-JCV 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 32 PSNR on UVG VCT (Ours) ELF-VC (ICCV'21) DCVC (Neur IPS'21) FVC (CVPR'21) SSF (CVPR'20) RLVC (J-STSP'21) Liu et al. (ECCV'20) DVC (CVPR'19) HEVC veryslow HEVC medium H.264 medium 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.960 bpp mbps 5.7 11.4 17.0 MS-SSIM on MCL-JCV 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.95 bpp mbps 24.9 49.8 74.6 MS-SSIM on UVG Figure 4: Comparing rate-distortion on MCL-JCV ( 27FPS) and UVG (120FPS). We report bits per pixel (bpp) and megabits per second (mbps). For MS-SSIM, we only show methods optimized for it (using tune ssim for HEVC/H264). App. A.5 shows a large version of these plots. powerful model by conditioning Tcur on zjoint as well as already transmitted tokens within the block. For entropy coding, both the sender and the receiver must be able to obtain exactly the same PMFs, i.e., Tcur must be causal and start from a known initialization point. For the latter, we learn a start token t S. To send the tokens, we first obtain zjoint. After that, we feed [t S] to Tcur, obtain P(t1|t S; zjoint), and use entropy coding to store the d C symbols in token t1 into a bitstream using P(t1|t S; zjoint). Then, we feed [t S, t1], obtain P(t2|t1, t S; zjoint), store t2 in the bitstream, and so on. The receiver gets the resulting bitstream and can obtain the same distributions, and thereby the tokens, by first feeding [t S] to Tcur, obtaining P(t1|t S; zjoint), entropy decoding t1 from the bitstream, then feeding [t S, t1] to obtain P(t2|t1, t S; zjoint), and so on. Fig. 3 visualizes this for P(t3| . . . ). We run this procedure in parallel over all blocks, and thereby send/receive yi by running Tcur w2 c =16 times. Each run produces H/wc W/wc d C distributions. To ensure causality of Tcur during training, we mask the self-attention blocks similar to [35]. Independence Apart from assuming blocks in yi are independent, we emphasize that each token is a vector and that we assume the symbols within each token are conditionally independent given previous tokens, i.e., Tcur predicts the d C distributions required for a token at once. One could instead predict a joint distribution over all possible |S|d C realisations, use channel-autoregression [27], or use vector quantization on tokens. The latter two are interesting directions for future work. Finally, we note that we do not rely on additional side information, in contrast to, e.g., autoregressive image compression entropy models [26, 27]. 3.3 Architectures Transformers As visualized in Fig. 3, all of our transformers are based on standard architectures [35, 10]. We start by projecting the d C-dimensional tokens to a d T -dimensional space (d T =768 in our model) using a single fully connected layer, and adding a learned positional embedding. While both Tsep and Tjoint are stacks of multi-head self-attention (MHSA) layers, Tcur uses masked conditional transformer layers, similar to Vaswani et al. [35]: These alternate between masked MHSA layers and MHSA layers that use zjoint as keys (K) and values (V), as shown in Fig. 3. We use 6 transformer layers for Tsep, 4 for Tjoint, and 5 masked transformer layers for Tcur. We use 16 a) Shift b) Sharpen Or Blur c) Fade 0 10 20 30 Amount of motion x in videos Average Evaluation Loss VCT (Ours) HEVC SSF 2 1 0 1 2 sharpen | blur 0.00 0.05 0.10 0.15 0.20 Fade speed x Figure 5: To understand what types of temporal patterns our transformer has learned to exploit, we synthesize videos representing commonly seen patterns. We compare to HEVC, which has built-in support for motion, and SSF, which has built-in support for motion and blurrying. VCT learns to handle all patterns purely from data. We refer to the text for a discussion. attention heads everywhere. We learn a separate temporal positional embedding to add to the input of Tjoint. Image encoder E, decoder D The image encoder and decoder E, D are not the focus of this paper, so we use architectures based on standard image compression approaches [26, 27]. For the encoder, we use 4 strided convolutional layers, downscaling by a factor 16 in total. For the decoder, we use transposed convolutions and additionally add residual blocks at the low resolutions. We use d ED=192 filters for all layers. See App. A.1 for details and an exploration of architecture variants. 3.4 Loss and Training Process The modeling choices introduced in the previous section allow for an efficient training procedure where we decompose the training into three stages, which enables rapid experimentation (Tab. 1). In Stage I we train the per-frame encoder E and decoder D by minimizing the rate-distortion trade-off [43, Sec 3.1.1]. Let U denote a uniform distribution in [ 0.5, 0.5]. We minimize LI = Ex p X,u U[ log p( y + u) | {z } bit-rate r +λ MSE(x, ˆx) | {z } distortion d ], y=E(x), ˆx=D(round STE( y)), (1) using y to refer to the unquantized representation, and x p X are frames drawn from the training set. Intuitively, we want to minimize the reconstruction error under the constraint that we can effectively quantize the encoder output, with λ controlling the tradeoff. For Stage I, we thus employ the meanscale hyperprior [26] approach to estimate p, the de facto standard in neural image compression, which we discard for later stages.4 To enable end-to-end training, we also follow [26], adding i.i.d. 4In short, the hyperprior estimates the PMF of y using a VAE [18], by predicting p(y|z), where z is side information transmitted first. We refer to the paper for details [26]. Components trained Loss B NF LR Steps steps/s Stage I E, D r + λd 16 1 1E 4 2M 100 Stage II Tsep, Tjoint, Tcur r 32 3 1E 4 1M 10 Stage III Tsep, Tjoint, Tcur, E, D r + λd 32 3 2.5E 5 250k 5 Table 1: We split training in three stages for training efficiency (note the steps/s column). λ controls the rate-distortion trade-off, r is bitrate, d is distortion, B is batch size, NF the number of frames. ˆxi 2 ˆxi 1 Decode: 0 tokens block 2 tokens block 13 tokens block 0k B 0.13k B 0.78k B Figure 6: Visualizing the sample mean from the block-autoregressive distribution predicted by the transformer, as we decode more and more tokens (see Sec. 5). We show the kilobytes (k B) required to transmit the decoded (gray) tokens. On the left, we see the two previous reconstructions ˆxi 2, ˆxi 1. In the middle, we see what the transformer expects at the current frame, before decoding any information (0k B). The next two images shows that as we decode more tokens, the model gets more certain, and the image obtained from the sample mean sharpens. Note that we never sample from the model for actual video coding. uniform noise u to y when calculating r, and using straight-through estimation (STE) [33, 27] for gradients when rounding y to feed it to D. For Stage II, we train the transformer to obtain p, and only minimize rate: LII = E(x1,x2,x3) p X1:3,u U[ log p( y3 + u|y1, y2)] yi=E(x), yi=round( yi), (2) where (x1, x2, x3) p X1:3 are triplets of adjacent video frames. We assume each of the d C unquantized elements in each token follow a Gaussian distribution, p N, and let the transformer predict d C means and d C scales per token. Finally, we finetune everything jointly in Stage III, adding the distortion loss d from Eq. 1 to Eq. 2. We note that it is also possibe to train the model from scratch and obtain even better performance, see App. A.2. To obtain a discrete PMF P for the quantized symbols (for entropy coding), we again follow standard practice [4], convolving p with a unit-width box and evaluating it at discrete points, P(y) = R u U p(y+u)du, y Z [see, e.g., 43, Sec. 3.3.3, for details]. To train, we use random spatio-temporal crops of (B, NF , 256, 256, 3) pixels, where B is the batch size, and NF the number of frames (values are given in Tab. 1). We use the linearly decaying learning rate (LR) schedule with warmup, where we warmup for 10k steps and then linearly decay from the LR shown in the table to 1E 5. Stage I is trained using λ=0.01. To navigate the rate-distortion trade-off and obtain results for multiple rates, we fine-tune 9 models in Stage III, using λ=0.01 2i, i { 3, . . . , 5}. We train all models on 4 Google Cloud TPUv4 chips. 3.5 Latent Residual Predictor (LRP) To further leverage the powerful representation that the transformer learns, we adapt the latent residual predictor (LRP) from recent work in image compression [27]: The final features zcur from Tcur have the same spatial dimensions as yi, and contain everything the transformer knows about the current and previous representations. Since we have to compute them to compute P, they constitute free extra features that are helpful to reconstruct ˆxi. We thus use zcur by feeding y i = yi+f LRP(zcur) to D (we enable this in Stage III), where f LRP consists of a 1 1 convolution mapping from d T to d ED followed by a residual block. We note that this implies that ˆxi = D(y i) indirectly depends on yi 2, yi 1, yi. Since this is a bounded window into the past and y i does not depend on ˆxjk| . . . ) over all unseen (not yet decoded) tokens. Intuitively, if the transformer is certain about the future, this distribution will be concentrated on the actual future tokens we will decode. In Fig. 6, we visualize the sample mean of this distribution by feeding it through D, i.e., we sample N realisations of the unseen tokens in each block, conditioned on the k already decoded ones, for k {0, 2, 13}. In the middle image in Fig. 6, we show what the transformer expects at the current frame, before decoding any information (k = 0, i.e., 0 bits). We observe that the model is able to some degree to learn second order motion implicitly. The next two images shows that as we decode more tokens, the model gets more certain, and the image sharpens. 5.4 Ablations In Tab. 2, we explore the importance of temporal context from previous frames and latent residual prediction (LRP) on MCL-JCV. We start from a baseline that does not use any previous frames, i.e., an image model, used to independently code each frame. Conditioning on one previous frame reduces bitrate by 58%. Using two previous frames yields an additional improvement of 6%. In the final configuration (our model, VCT), which adds LRP, we observe an increase in PSNR of 0.7d B at the same bitrate. We did not observe further gains from more context. 5L=r + λd. To calculate L for HEVC, we find the quality factor q matching our λ via q=arg minqr(q) + λd(q), which yields q=25 for λ=0.01. Tsep and Tjoint Tcur EC D FPS estimate Ours 1080p 168 ms 326 ms 30.5 ms 168 ms 1.4 FPS 720p 37.6 ms 44.8 ms 17.0 ms 49.5 ms 6.7 FPS 480p 18.1 ms 23.1 ms 9.02 ms 23.3 ms 13.6 FPS 360p 7.3 ms 14.9 ms 4.24 ms 10.1 ms 27.3 FPS Table 3: Runtimes of our components. For ours, we use a Google Cloud TPU v4 to run transformers and D. Entropy Coding (EC) is run on CPU. 5.5 Runtime To obtain runtimes of the transformers (Tsep, Tjoint, Tcur) and the decoder (D), we employ a Google Cloud TPU v4 (single core) using Flax [16], which has an efficient implementation for autoregressive transformers. We use Tensorflow Compression to measure time spent entropy coding (EC), on an Intel Skylake CPU core. In Tab. 3, we report numbers for 1280 720px, 852 480px, and 480 360px. Since this benchmark is not fully end-to-end, we only report an FPS estimate by calculating 1000/(sum of individual runtimes in ms). Note that running Tcur at 720p once only takes 2.8ms, but we run it w2 c =16 times to decode a frame. To run Tjoint, we only have to run Tsep once per representation, since we can re-use the output of running Tsep on the previous representation. Many neural compression methods do not detail inference time and do not have code available, but we copy the results from DCVC [19], FVC [17], and ELF-VC [30], in Table 4. 6 Conclusion and Future Work We presented an elegantly simple transformer-based approach to neural video compression, outperforming previous methods without relying on architectural priors such as explicit motion prediction or warping. Notably, our results are achieved by conditioning the transformer only on a 2-frame window into the past. For some types of videos, it would be interesting to scale this up, or to introduce a notion of more long-term memory, possibly leveraging arbitrary reference frames. As mentioned towards the end of Sec. 3.2, various different ways to factorize the distributions could be explored, including vector quantization, channel-autoregression, or changing the independence assumptions around how we split representations into blocks. Societal Impact We hope our method can serve as the foundation for a new generation of video codecs. This could have a net-positive impact on society by reducing the bandwidth needed for video conferencing and video streaming and to better utilize storage space, therefore increasing the capacity of knowledge preservation. Acknowledgements We thank Basil Mustafa, Ashok Popat, Huiwen Chang, Phil Chou, Johannes Ballé, and Nick Johnston for the insightful discussions and feedback. Resolution FPS estimate Ours 1080p 1.4 FPS 720p 6.7 FPS 480p 13.6 FPS 360p 27.3 FPS DCVC [19] 1080p 1.1 FPS FVC [17] 1080p 1.8 FPS ELF-VC [30] 1080p 18 FPS 720p 35 FPS Table 4: Comparing decoding speed to other methods. We directly copy reported results from the respective papers, so platforms are not comparable. [1] Eirikur Agustsson et al. Scale-space flow for end-to-end optimized video compression . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 8503 8512. [2] Anurag Arnab et al. Vivit: A video vision transformer . In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 6836 6846. [3] Johannes Ballé et al. Nonlinear transform coding . In: IEEE Journal of Selected Topics in Signal Processing 15.2 (2020), pp. 339 353. [4] Johannes Ballé et al. Variational image compression with a scale hyperprior . In: International Conference on Learning Representations (ICLR). 2018. [5] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is Space-Time Attention All You Need for Video Understanding? In: International Conference on Machine Learning. PMLR. 2021, pp. 813 824. [6] Tom Brown et al. Language models are few-shot learners . In: Advances in neural information processing systems 33 (2020), pp. 1877 1901. [7] Aakanksha Chowdhery et al. Palm: Scaling language modeling with pathways . In: ar Xiv preprint ar Xiv:2204.02311 (2022). [8] Jacob Devlin et al. Bert: Pre-training of deep bidirectional transformers for language understanding . In: ar Xiv preprint ar Xiv:1810.04805 (2018). [9] Abdelaziz Djelouah et al. Neural inter-frame compression for video coding . In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 6421 6429. [10] Alexey Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale . In: ar Xiv preprint ar Xiv:2010.11929 (2020). [11] Sergey Edunov et al. Understanding back-translation at scale . In: ar Xiv preprint ar Xiv:1808.09381 (2018). [12] Haoqi Fan et al. Multiscale vision transformers . In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 6824 6835. [13] Adam Golinski et al. Feedback recurrent autoencoder for video compression . In: Proceedings of the Asian Conference on Computer Vision. 2020. [14] Amirhossein Habibian et al. Video compression with rate-distortion autoencoders . In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 7033 7042. [15] Dailan He et al. Elic: Efficient learned image compression with unevenly grouped spacechannel contextual adaptive coding . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 5718 5727. [16] Jonathan Heek et al. Flax: A neural network library and ecosystem for JAX. Version 0.4.2. 2020. URL: http://github.com/google/flax. [17] Zhihao Hu, Guo Lu, and Dong Xu. FVC: A new framework towards deep video compression in feature space . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 1502 1511. [18] Diederik P Kingma and Max Welling. Auto-encoding variational bayes . In: ar Xiv preprint ar Xiv:1312.6114 (2013). [19] Jiahao Li, Bin Li, and Yan Lu. Deep contextual video compression . In: Advances in Neural Information Processing Systems 34 (2021). [20] Bowen Liu et al. Deep learning in latent space for video prediction and compression . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 701 710. [21] Jerry Liu et al. Conditional entropy coding for efficient video compression . In: European Conference on Computer Vision. Springer. 2020, pp. 453 468. [22] Ze Liu et al. Swin transformer: Hierarchical vision transformer using shifted windows . In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 10012 10022. [23] Guo Lu et al. Dvc: An end-to-end deep video compression framework . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 11006 11015. [24] Fabian Mentzer et al. Neural Video Compression using GANs for Detail Synthesis and Propagation . In: ar Xiv preprint ar Xiv:2107.12038 (2021). [25] Alexandre Mercat, Marko Viitanen, and Jarno Vanne. UVG dataset: 50/120fps 4K sequences for video codec analysis and development . In: Proceedings of the 11th ACM Multimedia Systems Conference. 2020, pp. 297 302. [26] David Minnen, Johannes Ballé, and George D Toderici. Joint autoregressive and hierarchical priors for learned image compression . In: Advances in Neural Information Processing Systems. 2018, pp. 10771 10780. [27] David Minnen and Saurabh Singh. Channel-wise autoregressive entropy models for learned image compression . In: ar Xiv preprint ar Xiv:2007.08739 (2020). [28] Daniel Neimark et al. Video transformer network . In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 3163 3172. [29] Yichen Qian et al. Entroformer: A Transformer-based Entropy Model for Learned Image Compression . In: ar Xiv preprint ar Xiv:2202.05492 (2022). [30] Oren Rippel et al. Elf-vc: Efficient learned flexible-rate video coding . In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 14479 14488. [31] Gary J Sullivan et al. Overview of the high efficiency video coding (HEVC) standard . In: IEEE Transactions on circuits and systems for video technology 22.12 (2012), pp. 1649 1668. [32] Chen Sun et al. Videobert: A joint model for video and language representation learning . In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 7464 7473. [33] Lucas Theis et al. Lossy image compression with compressive autoencoders . In: International Conference on Learning Representations (ICLR). 2017. [34] George Toderici et al. CLIC 2020: Challenge on Learned Image Compression. http:// compression.cc. 2020. [35] Ashish Vaswani et al. Attention is all you need . In: Advances in neural information processing systems 30 (2017). [36] Haiqiang Wang et al. MCL-JCV: a JND-based H.264/AVC video quality assessment dataset . In: 2016 IEEE International Conference on Image Processing (ICIP). IEEE. 2016, pp. 1509 1513. [37] Huiyu Wang et al. Max-deeplab: End-to-end panoptic segmentation with mask transformers . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 5463 5474. [38] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment . In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003. Vol. 2. Ieee. 2003, pp. 1398 1402. [39] Chao-Yuan Wu, Nayan Singhal, and Philipp Krahenbuhl. Video compression through image interpolation . In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 416 431. [40] Ren Yang, Luc Van Gool, and Radu Timofte. Perceptual Learned Video Compression with Recurrent Conditional GAN . In: ar Xiv preprint ar Xiv:2109.03082 (2021). [41] Ren Yang et al. Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model . In: IEEE Journal of Selected Topics in Signal Processing 15.2 (2021), pp. 388 401. [42] Ruihan Yang et al. Hierarchical autoregressive modeling for neural video compression . In: ar Xiv preprint ar Xiv:2010.10258 (2020). [43] Y. Yang, S. Mandt, and L. Theis. An Introduction to Neural Data Compression . preprint. 2022. URL: https://arxiv.org/abs/2202.06533. [44] Richard Zhang. Making convolutional networks shift-invariant again . In: International conference on machine learning. PMLR. 2019, pp. 7324 7334. [45] Sixiao Zheng et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers . In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021, pp. 6881 6890. [46] Yinhao Zhu, Yang Yang, and Taco Cohen. Transformer-based Transform Coding . In: International Conference on Learning Representations. 2021. Neur IPS Checklist 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] see Sec 6. (c) Did you discuss any potential negative societal impacts of your work? [Yes] see Societal Impact in Sec 6. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] We cannot release training data but will release code if the paper is published. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] However, we find in most experiments, multiple runs end at similar final losses. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We specify training platform and training times in 3.4, as well as how many models we train. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] See Sec 4. (b) Did you mention the license of the assets? [Yes] See Sec 4. (c) Did you include any new assets either in the supplementary material or as a URL? [Yes] We will release a Github URL to our code upon publication. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] We don t release new data. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] We don t release new data. 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] No crowdsourcing or human subjects. (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] No crowdsourcing or human subjects. (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A] No crowdsourcing or human subjects.