# deep_hierarchical_video_compression__ced31331.pdf Deep Hierarchical Video Compression Ming Lu1,2, Zhihao Duan3, Fengqing Zhu3, and Zhan Ma1* 1School of Electronic Science and Engineering, Nanjing University 2Interdisciplinary Research Center for Future Intelligent Chips (Chip-X), Nanjing University 3Elmore Family School of Electrical and Computer Engineering, Purdue University minglu@nju.edu.cn, duan90@purdue.edu, zhu0@purdue.edu, mazhan@nju.edu.cn Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational Auto Encoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video frames. Instead, this work proposes hierarchical probabilistic predictive coding, for which hierarchal VAEs are carefully designed to characterize multiscale latent features as a family of flexible priors and posteriors to predict the probabilities of future frames. Under such a hierarchical structure, lightweight networks are sufficient for prediction. The proposed method outperforms representative learned video compression models on common testing videos and demonstrates computational friendliness with much less memory footprint and faster encoding/decoding. Extensive experiments on adaptation to temporal patterns also indicate the better generalization of our hierarchical predictive mechanism. Furthermore, our solution is the first to enable progressive decoding that is favored in networked video applications with packet loss. Introduction Deep learning breathes fresh life into the visual signal (e.g., images and videos) compression community that has been dominated by handcrafted codecs for decades (Wallace 1991; Marcellin et al. 2000; Wiegand et al. 2003; Sullivan et al. 2012; Bross et al. 2021). Instead of manually designing and optimizing individual modules such as transforms, mode selection, and quantization in traditional codecs, data-driven approaches adopt end-to-end learning of neural networks (Ball e, Laparra, and Simoncelli 2016; Theis et al. 2017). Despite the conceptual simplicity, learned image compression methods have achieved superior ratedistortion performance, surpassing the latest VVC (Versatial Video Coding (Bross et al. 2021)) intra codec (He et al. 2022; Lu et al. 2022). For videos, however, learning-based methods are still not free from the shackles of the traditional hybrid framework. Most existing methods follow the two-stage pipeline shown *Corresponding Author Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Motion Codec Residual Codec 饾懅! # 饾憹(饾懅! "|饾憤! $", 饾憤$! " ) Figure 1: Interframe coding using (a) hybrid motion & residual coding, (b) single-scale probabilistic predictive coding, and (c) hierarchical probabilistic predictive coding (Ours). in Fig. 1a: code motion flows first and then the residual between the current and motion-warped frame, either in an explicit (Lu et al. 2019) or conditional (Li, Li, and Lu 2021) manner. This framework is usually cumbersome in design (for example, separate models for intraframe coding, inter residual coding, motion coding, and motion estimation are required); thus, extensive hyperparameter tuning is necessary. Furthermore, inaccurate motion-induced warping error propagates inevitably across temporal frames, gradually degrading the quality of reconstructed frames over time. As a promising solution to the problems mentioned earlier, (latent-space) probabilistic predictive coding attempts to reduce temporal redundancy by conditionally predicting future frames in a one-shot manner. Intuitively, if the current frame can be well predicted through the past frames, motion (e.g., flow) estimation and compensation can be completely exempted, and the aforementioned error propagation can also be eliminated accordingly. Recently, Mentzer et al. (Mentzer et al. 2022) proposed a probabilistic predictive video coding framework named Video Compression Transformer (VCT). Under the VAE-based image compression framework, VCT models the latent features of the current frame conditioned on the previous-frame latent features using a transformer-based temporal entropy model. Though VCT outperforms many previous video coding methods, its The Thirty-Eighth AAAI Conference on Arti铿乧ial Intelligence (AAAI-24) conditional prediction of single-scale latent features at 1/16 resolution of the original frame in Fig. 1b fundamentally constrains its characterization capacity, which ignores multiscale characteristics of video frames. This paper proposes a hierarchical probabilistic predictive coding, termed DHVC, in which conditional probabilities of multiscale latent features of future frames are effectively modeled using deliberately-designed, powerful hierarchical VAEs. The latent distribution at a certain scale in the current frame is predicted by the prior features from previous scales in the same frame and the corresponding scale of the previous frames. Doing so gives us a powerful and efficient modeling ability to characterize arbitrary feature distributions. For instance, Mentzer et al. (Mentzer et al. 2022) relied on a complicated prediction in a block-level autoregressive manner, which is inefficient. Instead, we perform a multi-stage conditional probability prediction, showing better performance and desiring less complexity. Upon extensive evaluations using commonly used video sequences, our method outperforms well-known learned models using hybrid motion and residual coding and previous state-of-the-art method using latent probabilistic predictive coding. Extensive studies on the adaptation to various temporal patterns also reveal the generalization of our hierarchical predictive mechanism. In addition, our method also supports temporal progressive decoding, being the first learned progressive video coding method to our best knowledge. Therefore, it can handle packet losses induced by poor network connections to some extent. Our contributions can be summarized as follows: We propose a hierarchical probabilistic prediction model for video coding. Our model employs a collection of multiscale latent variables to represent the coarse-to-fine nature of video frames scale-wisely. We propose the spatial-temporal prediction and in-loop decoding fusion modules, which enable better performance, lower memory consumption, and faster encoding/decoding than the previous best probabilistic predictive coding-based method (Mentzer et al. 2022). Experiments demonstrate that our method is better generalized to various temporal patterns. Our model is also the first to support the functionality of progressive decoding. Related Work We briefly review end-to-end learned video coding methods, including classical hybrid motion and residual coding and recently-emerged probabilistic predictive coding. We also theoretically explain the hierarchical VAE formalism as it provides the basis for our method. Learned Video Coding Data compression and variational autoencoders (VAEs): Let x denote data (e.g., image or video) with an unknown distribution. Traditional image/video coding belongs to the transform coding, where one wants to find an encoder fe, a decoder fd, and an entropy model for the transform coefficients such that the rate-distortion cost is minimized: min H(fe(x)) + 位 d(x, fd(fe(x))). (1) Here, the first term is the (cross-) entropy of the compressed coefficients approximating the rate, d is a distortion function, and 位 is the Lagrange multiplier that balances the rate and distortion tradeoff. As studied in (Ball e et al. 2018; Duan et al. 2023), transform coding can be equivalently considered as data distribution modeling using variational autoencoders (VAEs). Specifically, VAEs assume a model of data: p(x, z) = p(x | z) p(z), (2) where z is latent variables like transform coefficients. In VAE, a prior p(z) describes the distribution of the latent variable, a decoder p(x | z) maps latent-space elements to original data-space signal, and an approximate posterior q(z | x) (i.e., the encoder) encodes data into the latent space. Letting 藛x p(x | z) denote the reconstruction, the objective can be written as (Yang and Mandt 2022; Duan et al. 2023) min DKL(q(z | x) p(z)) + 位 d(x, 藛x), (3) and if the posterior q(z | x) is deterministic and discrete (e.g., when quantization is applied to z), this VAE objective equals the rate-distortion optimization in Eq. (1). Such a connection has inspired many subsequent works to apply VAE-based probabilistic methods to compression task, such as (Yang, Bamler, and Mandt 2020b; Agustsson and Theis 2020; Yang, Bamler, and Mandt 2020a; Theis and Ahmed 2022; Ryder et al. 2022; Chen et al. 2022). Learned video coding methods can be generally categorized into two groups: hybrid motion & residual coding and probabilistic predictive coding. Hybrid Motion & Residual Coding refers to the classical coding framework with motion and residual processing. Lu et al. (Lu et al. 2019) first proposed to use two similar VAE-based networks to code the motion and residuals, respectively, which was then enhanced with better motion alignment in (Lu et al. 2020; Liu et al. 2020a). Then, Hu et al. (Hu, Lu, and Xu 2021) migrated the motion alignment to the feature domain and achieved better compression performance. Recently, by converting residual coding to conditional coding of aligned features, Li et al. (Li, Li, and Lu 2021) took the learned video coding to a new level of performance. Subsequently, by further integrating multi-scale aligned feature fusion, post-processing, and bitrate allocation, learned video coding algorithms achieved unprecedented compression efficiency, surpassing the latest VVC (Li, Li, and Lu 2022). Probabilistic Predictive Coding is an emerging video coding method. Liu et al. (Liu et al. 2020b) relied on stacked convolutions for latent distribution prediction, while VCT (Mentzer et al. 2022) adopted Transformer for the same purpose. Both works perform temporally-conditional distribution prediction only using single-scale latent variables (i.e., 1/16 of the original resolution), which greatly constrains the accuracy of probability estimation and leads to sub-optimal predictive performance. Therefore, in this paper, we propose a hierarchical probabilistic predictive coding method, which substantially improves the accuracy and efficiency of temporal prediction by characterizing multiscale latent features for conditional probability estimation in a coarse-to-fine approach. The Thirty-Eighth AAAI Conference on Arti铿乧ial Intelligence (AAAI-24) Hierarchical VAEs To improve the flexibility and expressiveness of single-sale VAE, hierarchical VAEs (Kingma et al. 2016; Child 2020; Vahdat and Kautz 2020) employ multiscale latent variables, denoted by Z = {z1, ..., z L}. Accordingly, the latent priors can be factorized as: l=1 p(zl | Z