# transformerbased_transform_coding__91c9eaec.pdf Published as a conference paper at ICLR 2022 TRANSFORMER-BASED TRANSFORM CODING Yinhao Zhu Yang Yang Taco Cohen Qualcomm AI Research {yinhaoz, yyangy, tacos}@qti.qualcomm.com Neural data compression based on nonlinear transform coding has made great progress over the last few years, mainly due to improvements in prior models, quantization methods and nonlinear transforms. A general trend in many recent works pushing the limit of rate-distortion performance is to use ever more expensive prior models that can lead to prohibitively slow decoding. Instead, we focus on more expressive transforms that result in a better rate-distortioncomputation trade-off. Specifically, we show that nonlinear transforms built on Swin-transformers can achieve better compression efficiency than transforms built on convolutional neural networks (Conv Nets), while requiring fewer parameters and shorter decoding time. Paired with a compute-efficient Channel-wise Auto Regressive Model prior, our Swin T-Ch ARM model outperforms VTM-12.1 by 3.68% in BD-rate on Kodak with comparable decoding speed. In P-frame video compression setting, we are able to outperform the popular Conv Net-based scalespace-flow model by 12.35% in BD-rate on UVG. We provide model scaling studies to verify the computational efficiency of the proposed solutions and conduct several analyses to reveal the source of coding gain of transformers over Conv Nets, including better spatial decorrelation, flexible effective receptive field, and more localized response of latent pixels during progressive decoding. 1 INTRODUCTION Transform coding (Goyal, 2001) is the dominant paradigm for compression of multi-media signals, and serves as the technical foundation for many successful coding standards such as JPEG, AAC, and HEVC/VVC. Codecs based on transform coding divide the task of lossy compression into three modularized components: transform, quantization, and entropy coding. All three components can be enhanced by deep neural networks: autoencoder networks are adopted as flexible nonlinear transforms, deep generative models are used as powerful learnable entropy models, and various differentiable quantization schemes are proposed to aid end-to-end training. Thanks to these advancements, we have seen rapid progress in the domain of image and video compression. Particularly, the hyperprior line of work (Ball e et al., 2018; Minnen et al., 2018; Lee et al., 2019; Agustsson et al., 2020; Minnen & Singh, 2020) has led to steady progress of neural compression performance over the past two years, reaching or even surpassing state-of-the-art traditional codecs. For example, in image compression, BPG444 was surpassed by a neural codec in 2018 (Minnen et al., 2018), and (Cheng et al., 2020; Xie et al., 2021; Ma et al., 2021; Guo et al., 2021; Wu et al., 2020) have claimed on-par or better performance than VTM (a test model of the state-of-the-art non-learned VVC standard). One general trend in the advancement of neural image compression schemes is to develop ever more expressive yet expensive prior models based on spatial context. However, the rate-distortion improvement from context based prior modeling often comes with a hefty price tag1 in terms of decoding complexity. Noteably, all existing works that claimed on-par or better performance than VTM (Cheng et al., 2020; Xie et al., 2021; Ma et al., 2021; Guo et al., 2021; Wu et al., 2020) rely on slow and expensive spatial context based prior models. Equal contribution. Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. 1In the extreme case when a latent-pixel-level spatial autoregressive prior is used, decoding of a single 512x768 image requires no less than 1536 interleaved executions of prior model inference and entropy decoding (assuming the latent is downsampled by a factor of 16x16). Published as a conference paper at ICLR 2022 The development of nonlinear transforms, on the other hand, are largely overlooked. This leads us to the following questions: can we achieve the same performance as that of expensive prior models by designing a more expressive transform together with simple prior models? And if so, how much more complexity in the transform is required? Decoding time (ms) BD-rate over VTM-12.1 (%) on Kodak Swin T-Ch ARM Conv-Ch ARM Swin T-Hyperprior (different sizes) Conv-Hyperprior (different sizes) Lee et al. 2019 Guo et al. 2021 Cheng et al. 2020 Wu et al. 2020 Figure 1: BD-rate (smaller is better) vs decoding time. Our Swin transformer based image compression models land in a favorable region of rate-distortion-computation trade-off that has never been achieved before. See Section 4.2 for more results and evaluation setup. Interestingly, we show that by leveraging and adapting the recent development of vision transformers, not only can we build neural codecs with simple prior models that can outperform ones built on expensive spatial autoregressive priors, but do so with smaller transform complexity compared to its convolutional counterparts, attaining a strictly better ratedistortion-complexity trade-off. As can be seen in Figure 1, our proposed neural image codec Swin T-Ch ARM can outperform VTM-12.1 at comparable decoding time, which, to the best of our knowledge, is a first in the neural compression literature. As main contributions, we 1) extend Swin Transformer (Liu et al., 2021) to a decoder setting and build Swin-transformer based neural image codecs that attain better rate-distortion performance with lower complexity compared with existing solutions, 2) verify its effectiveness in video compression by enhancing scalespace-flow, a popular neural P-frame codec, and 3) conduct extensive analysis and ablation study to explore differences between convolution and transformers, and investigate potential source of coding gain. 2 BACKGROUND & RELATED WORK Conv-Hyperprior The seminal hyperprior architecture (Ball e et al., 2018; Minnen et al., 2018) is a two-level hierarchical variational autoencoder, consisting of a pair of encoder/decoder ga, gs, and a pair of hyper-encoder/hyper-decoder ha, hs. Given an input image x, a pair of latent y = ga(x) and hyper-latent z = ha(y) is computed. The quantized hyper-latent ˆz = Q(z) is modeled and entropycoded with a learned factorized prior. The latent y is modeled with a factorized Gaussian distribution p(y|ˆz) = N(µ, diag(σ)) whose parameter is given by the hyper-decoder (µ, σ) = hs(ˆz). The quantized version of the latent ˆy = Q(y µ)+µ is then entropy coded and passed through decoder gs to derive reconstructed image ˆx = gs(ˆy). The tranforms ga, gs, ha, hs are all parameterized as Conv Nets (for details, see Appendix A.1). Conv-Ch ARM (Minnen & Singh, 2020) extends the baseline hyperprior architecture with a channel-wise auto-regressive model (Ch ARM)2, in which latent y is split along channel dimension into S groups (denoted as y1, . . . , y S), and the Gaussian prior p(ys|ˆz, ˆy