# deep_contextual_video_compression__67963fbe.pdf Deep Contextual Video Compression Jiahao Li, Bin Li, Yan Lu Microsoft Research Asia {li.jiahao, libin, yanlu}@microsoft.com Most of the existing neural video compression methods adopt the predictive coding framework, which first generates the predicted frame and then encodes its residue with the current frame. However, as for compression ratio, predictive coding is only a sub-optimal solution as it uses simple subtraction operation to remove the redundancy across frames. In this paper, we propose a deep contextual video compression framework to enable a paradigm shift from predictive coding to conditional coding. In particular, we try to answer the following questions: how to define, use, and learn condition under a deep video compression framework. To tap the potential of conditional coding, we propose using feature domain context as condition. This enables us to leverage the high dimension context to carry rich information to both the encoder and the decoder, which helps reconstruct the high-frequency contents for higher video quality. Our framework is also extensible, in which the condition can be flexibly designed. Experiments show that our method can significantly outperform the previous state-of-the-art (SOTA) deep video compression methods. When compared with x265 using veryslow preset, we can achieve 26.0% bitrate saving for 1080P standard test videos. The codes are at https://github.com/Deep MC-DCVC/DCVC. 1 Introduction From H.261 [1] developed in 1988 to the just released H.266 [2] in 2020, all traditional video coding standards are based on a predictive coding paradigm, where the predicted frame is first generated by handcrafted modules and then the residue between the current frame and the predicted frame is encoded and decoded. Recently, many deep learning (DL)-based video compression methods [3 11] also adopt the predictive coding framework to encode the residue, where all handcrafted modules are merely replaced by neural networks. Encoding residue is a simple yet efficient manner for video compression, considering the strong temporal correlations among frames. However, residue coding is not optimal to encode the current frame xt given the predicted frame xt, because it only uses handcrafted subtraction operation to remove the redundancy across frames. The entropy of residue coding is greater than or equal to that of conditional coding [12]: H(xt xt) H(xt| xt), where H represents the Shannon entropy. Theoretically, one pixel in frame xt correlates to all the pixels in the previous decoded frames and the pixels already been decoded in xt. For traditional video codec, it is impossible to use the handcrafted rules to explicitly explore the correlation by taking all of them into consideration due to the huge space. Thus, residue coding is widely adopted as a special extremely simplified case of conditional coding, with the very strong assumption that the current pixel only has the correlation with the predicted pixel. DL opens the door to automatically explore correlations in a huge space. Considering the success of DL in image compression [13, 14], which just uses autoencoder to explore correlation in image, why not use network to build the conditional coding-based autoencoder to explore correlation in video rather than restricting our vision into residue coding? 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Commonly-used residue coding-based video compression Our deep contextual video compression Residue coding Context Generation Conditional coding 𝑥𝑡 𝑥𝑡 𝑥𝑡: predicted frame in RGB domain ҧ𝑥𝑡: context in feature domain Figure 1: Paradigm shift from residue coding-based framework to conditional coding-based framework. xt is the current frame. ˆxt and ˆxt 1 are the current and previous decoded frames. The orange dashed line means that the context is also used for entropy modeling. When we design the conditional coding-based solution, a series of questions naturally come up: What is condition? How to use condition? And how to learn condition? Technically speaking, condition can be anything that may be helpful to compress the current frame. The predicted frame can be used as condition but it is not necessary to restrict it as the only representation of condition. Thus, we define the condition as learnable contextual features with arbitrary dimensions. Along this idea, we propose a deep contextual video compression (DCVC) framework to utilize condition in a unified, simple, yet efficient approach. The diagram of our DCVC framework is shown in Fig. 1. The contextual information is used as part of the input of contextual encoder, contextual decoder, as well as the entropy model. In particular, benefiting from the temporal prior provided by context, the entropy model itself is temporally adaptive, resulting in a richer and more accurate model. As for how to learn condition, we propose using motion estimation and motion compensation (MEMC) at feature domain. The MEMC can guide the model where to extract useful context. Experimental results demonstrate the effectiveness of the proposed DCVC. For 1080p standard test videos, our DCVC can achieve 26.0% bitrate saving over x265 using veryslow preset, and 16.4% bitrate saving over previous SOTA DL-based model DVCPro [4]. Actually, the concept of conditional coding has appeared in [15, 16, 12, 17]. However, these works are only designed for partial module (e.g., only entropy model or encoder) or need handcrafted operations to filter which content should be conditionally coded. By contrast, our framework is a more comprehensive solution which considers all of encoding, decoding, and entropy modeling. In addition, the proposed DCVC is an extensible conditional coding-based framework, where the condition can be flexibly designed. Although this paper proposes using feature domain MEMC to generate contextual features and demonstrates its effectiveness, we still think it is an open question worth further investigation for higher compression ratio. Our main contributions are four-folded: We design a deep contextual video compression framework based on conditional coding. The definition, usage, and learning manner of condition are all innovative. Our method can achieve higher compression ratio than previous residue coding-based methods. We propose a simple yet efficient approach using context to help the encoding, decoding, as well as the entropy modeling. For entropy modeling, we design a model which utilizes spatial-temporal correlation for higher compression ratio or only utilizes temporal correlation for fast speed. We define the condition as the context in feature domain. The context with higher dimensions can provide richer information to help reconstruct the high frequency contents. Our framework is extensible. There exists great potential in boosting compression ratio by better defining, using, and learning the condition. 2 Related works Deep image compression Recently there are many works for deep image compression. For example, the compressive autoencoder [18] could get comparable results with JPEG 2000. Subsequently, many works boost the performance by more advanced entropy models and network structures. For example, Ballé et al. proposed the factorized [19] and hyper prior [13] entropy models. The method based on hyper prior catches up H.265 intra coding. The entropy model jointly utilizing hyper prior and auto regressive context outperforms H.265 intra coding. The method with Gaussian mixture model [20] is comparable with H.266 intra coding. For the network structure, some RNN (recurrent neural network)-based methods [21 23] were proposed in the early development stage, but most of recent methods are based on CNN (convolutional neural network). Deep video compression Existing works for deep video compression can be classified into two categories, i.e. non-delay-constrained and delay-constrained. For the first category, there is no restriction on reference frame location, which means that the reference frame can be from future. For example, Wu et al. [10] proposed interpolating the predicted frame from previous and future frames, and then the frame residue is encoded. Djelouah et al.[8] also followed this coding structure and introduced the optical flow estimation network to get better predicted frame. Yang et al. [6] designed a recurrent enhancement module for this coding structure. In addition, 3D autoencoder was proposed to encode group of pictures in [24, 25]. This is a natural extension of deep image compression by increasing the dimension of input. It is noted that this coding manner will bring larger delay and the GPU memory cost will be significantly increased. For the delay-constrained methods, the reference frame is only from the previous frames. For example, Lu et al. [3] designed the DVC model, where all modules in traditional hybrid video codec are replaced by networks. Then the improved model DVCPro which adopts more advanced entropy model from [14] and deeper network was proposed in [4]. Following the similar framework with DVC, Agustsson et al. designed a more advanced optical flow estimation in scale space. Hu et al. [26] considered the rate distortion optimization when encoding motion vector (MV). In [7], the single reference frame is extended to multiple reference frames. Recently, Yang et al. [6] proposed an RNN-based MV/residue encoder and decoder. In [11], the residue is adaptively scaled by learned parameter. Our research belongs to the delay-constrained method as it can be applied in more scenarios, e.g. real time communication. Different from the above works, we design a conditional coding-based framework rather than following the commonly-used residue coding. Other video tasks show that utilizing temporal information as condition is helpful [27, 28]. For video compression, recent works in [15], [16], and [12, 17] have some investigations on condition coding. In [15], the conditional coding is only designed for entropy modeling. However, due to the lack of MEMC, the compression ratio is not high, and the method in [15] cannot outperform DVC in terms of PSNR. By contrast, our conditional coding designed for encoding, decoding, and entropy modeling can significantly outperform DVCPro. In [16], only encoder takes the conditional coding. However, for decoder, the residual coding is still adopted. As a latent state is used, the framework in [16] is difficult to train [7]. By contrast, we use explicit MEMC to guide the context learning, which is easier to train. In [12, 17], the video contents need to be explicitly classified into skip and non-skip modes, where only the contents with non-skip mode use the conditional coding. By contrast, our method does not need the handcrafted operation to decompose the video. In addition, the condition in DCVC is the context in feature domain, which has much larger capacity. In summary, when compared with [15], [16], and [12, 17], the definition, usage, and learning manner of the condition in DCVC are all innovative. 3 Proposed method In this section, we present the details of the proposed DCVC. We first describe the whole framework of DCVC. Then we introduce the entropy model for compressing the latent codes, followed by the approach of learning the context. At last, we provide the details about training. 3.1 The framework of DCVC In traditional video codec, the inter frame coding adopts the residue coding, formulated as: ˆxt = fdec fenc(xt xt) + xt with xt = fpredict(ˆxt 1). (1) Input frame 𝑥𝑡 Decoded frame 𝑥𝑡 1 Context ҧ𝑥𝑡 Feature extractor Motion estimation Context refinement Decoded frame 𝑥𝑡 Contextual decoder Contextual encoder Figure 2: The framework of our DCVC. xt is the current frame. ˆxt and ˆxt 1 are the current and previous decoded frames. For simplification, we use single reference frame in the formulation. fenc( ) and fdec( ) are the residue encoder and decoder. is the quantization operation. fpredict( ) represents the function for generating the predicted frame xt. In traditional video codec, fpredict( ) is implemented in the manner of MEMC, which uses handcrafted coding tools to search the best MV and then interpolates the predicted frame. For most existing DL-based video codecs [3 9], fpredict( ) is the MEMC which is totally composed of neural networks. In this paper, we do not adopt the commonly-used residue coding but try to design a conditional coding-based framework for higher compression ratio. Actually, one straightforward conditional coding manner is directly using the predicted frame xt as the condition: ˆxt = fdec fenc(xt| xt) | xt with xt = fpredict(ˆxt 1). (2) However, the condition is still restricted in pixel domain with low channel dimensions. This will limit the model capacity. Now that the conditional coding is used, why not let model learn the condition by itself? Thus, this paper proposes a contextual video compression framework, where we use network to generate context rather than the predicted frame. Our framework can be formulated as: ˆxt = fdec fenc(xt| xt) | xt with xt = fcontext(ˆxt 1). (3) fcontext( ) represents the function of generating context xt. fenc( ) and fdec( ) are the contextual encoder and decoder, which are different from residue encoder and decoder. Our DCVC framework is illustrated in Fig. 2. To provide richer and more correlated condition for encoding xt, the context is in the feature domain with higher dimensions. In addition, due to the large capacity of context, different channels therein have the freedom to extract different kinds of information. Here we give an analysis example in Fig. 3. In the figure, the upper right part shows four channel examples in context. Looking into the four channels, we can find different channels have different focuses. For example, the third channel seems to put a lot of emphases on the high frequency contents when compared with the visualization of high frequency in xt. By contrast, the second and fourth channels look like extracting color information, where the second channel focuses on green color and the fourth channel emphasizes the red color. Benefiting from these various contextual features, our DCVC can achieve better reconstruction quality, especially for the complex textures with lots of high frequencies. The bottom right image in Fig. 3 shows the reconstruction error reduction of DCVC when compared with residue coding-based framework. From this comparison, DCVC can achieve non-trivial error decrease on the high frequency regions in both background and foreground, which are hard to compress for many video codecs. As shown in Fig. 2, the encoding and decoding of the current frame are both conditioned on the context xt. Through contextual encoder, xt is encoded into latent codes yt. yt is then quantized as ˆyt via rounding operation. Via the contextual decoder, the reconstructed frame ˆxt is finally obtained. In our design, we use network to automatically learn the correlation between xt and xt and then : high frequency region in background Reduction of reconstruction error Input frame 𝑥𝑡 Channel examples in context ҧ𝑥𝑡 High frequency in 𝑥𝑡 Motion vector 𝑚𝑡 : high frequency region in foreground : new content region Previous decoded frame 𝑥𝑡 1 Figure 3: Visual examples from video SRC05 in MCL-JCV[29] dataset. The high frequency in xt is decomposed by discrete cosine transform. It shows that DCVC improves the reconstruction of high frequency contents in both background with small motion and foreground with large motion. In addition, DCVC is good at encoding the new content region caused by motion, where the reconstruction error can be significantly decreased compared with residue coding-based framework DVCPro[4]. The BPP (bits per pixel) of DCVC (0.0306) is smaller than that of DVCPro (0.0359). remove the redundancy rather than using fixed subtraction operation in residue coding. From another perspective, our method also has the ability to adaptively use the context. Due to the motion in video, new contents often appear in the object boundary regions. These new contents probably cannot find a good reference in previous decoded frame. In this situation, the DL-based video codec with frame residue coding is still forced to encode the residue. For the new contents, the residue can be very large and the inter coding via subtraction operation may be worse than the intra coding. By contrast, our conditional coding framework has the capacity to adaptively utilize the condition. For the new contents, the model can adaptively tend to learn intra coding. As shown in reconstruction error reduction in Fig. 3, the reconstruction error of new contents can be significantly reduced. In addition, this paper not only proposes using the context xt to generate the latent codes, but also proposes utilizing it to build the entropy model. More details are introduced in the next subsection. 3.2 Entropy model According to [30], the cross-entropy between the estimated probability distribution and the actual latent code distribution is a tight lower bound of the actual bitrate, namely R(ˆyt) Eˆyt qˆyt[ log2pˆyt(ˆyt)], (4) pˆyt(ˆyt) and qˆyt(ˆyt) are estimated and true probability mass functions of quantized latent codes ˆyt, respectively. Actually, the arithmetic coding almost can encode the latent codes at the bitrate of cross-entropy. The gap between actual bitrate R(ˆyt) and the bitrate of cross-entropy is negligible. So our target is designing an entropy model which can accurately estimate the probability distribution of latent codes pˆyt(ˆyt). The framework of our entropy model is illustrated in Fig. 4. First, we use the hyper prior model [13] to learn the hierarchical prior and use auto regressive network [14] to learn the spatial prior. The two priors (hierarchical prior and spatial prior) are commonly-used in deep image compression. However, for video, the latent codes also have the temporal correlation. Thus, we propose using the context xt to generate the temporal prior. As shown in Fig. 4, we design a temporal prior encoder to explore the temporal correlation. The prior fusion network will learn to Auto regressive Prior fusion HPD Temporal prior enc HPE Context ҧ𝑥𝑡 Side information Spatial correlation Temporal correlation Figure 4: Our entropy model used to encode the quantized latent codes ˆyt. HPE and HPD are hyper prior encoder and decoder. Q means quantization. AE and AD are arithmetic encoder and decoder. fuse the three different priors and then estimate the mean and scale of latent code distribution. In this paper, we follow the existing work [31] and assume that pˆyt(ˆyt) follows the Laplace distribution. The formulation of pˆyt(ˆyt) is: pˆyt(ˆyt| xt, ˆzt) = Y L(µt,i, σ2 t,i) U( 1 with µt,i, σt,i = fpf fhpd(ˆzt), far(ˆyt,