# glic_general_format_learned_image_compression__b1716175.pdf GLIC: General Format Learned Image Compression Ming Sheng Zhou, Ming Ming Kong* School of Computer and Software Engineering, Xi Hua University, Cheng Du, Si Chuan, China, 610039 mingshengzhou@foxmail.com, kongming000@126.com Learned image lossy compression techniques have surpassed traditional methods in both subjective vision and quantitative evaluation. However, current models are only applicable to three-channel image formats, limiting their practical application due to the diversity and complexity of image formats. We propose a high-performance learned image compression model for general image formats. We first introduce a transfer method to unify any-channel image formats, enhancing the applicability of neural networks. This method s effectiveness is demonstrated through image information entropy and image homomorphism theory. Then, we introduce an adaptive attention residual block into the entropy model to give it better generalization ability. Meanwhile, we propose an evenly grouped cross-channel context module for progressive preview image decoding. Experimental results demonstrate that our method achieves state-of-the-art (SOTA) in the field of learned image compression in terms of PSNR and MS-SSIM. This work extends the applicability of learned image compression techniques to more practical production environments. Introduction In recent years, image compression technology has advanced rapidly. Subjective visual and quantified indicators such as PSNR (Gonzales and Wintz 1987) and MSSSIM (Wang, Simoncelli, and Bovik 2003) have surpassed existing hand-crafted algorithms like JPEG (Pennebaker and Mitchell 1992) and VTM (Wien and Bross 2020), etc. Learned image compression also outperforms these algorithms in terms of rate-distortion (RD) performance at the same decoding quality. On one hand, traditional hand-crafted image compression algorithms typically support three-channel (RGB), one-channel (grayscale), or fourchannel (RGBA) images, etc. On the other hand, existing learned image compression methods (Ball e et al. 2018; Cheng et al. 2020; He et al. 2021, 2022; Gao et al. 2022; Xu et al. 2022; Lieberman et al. 2023; Lee, Jeong, and Kim 2022; Ali et al. 2024) only model quantization for threechannel images and cannot be extended to multi-channel images in general format, such as grayscale images or medical *Corresponding author. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Improvement in RD performance by the onechannel and AARB schemes compared to the baseline model (He et al. 2022). tomography images. Expanding the model s applicability to general image format compression and progressive decoding is necessary for the practical application of learned image compression techniques. Significant advancements have occurred in learned image compression, particularly regarding encoding and decoding performance. For example, (He et al. 2021) introduced a novel context encoding scheme based on a checkerboard, which enabled parallel computation in autoregressive context encoding and significantly enhanced encoding efficiency. (Minnen and Singh 2020) proposed a channel context coding scheme with even grouping, substantially improving the decoding process s efficiency. (He et al. 2022) presented an unevenly grouped channel context encoding scheme, further improving decoding efficiency and introducing a new preview image decoding scheme. Despite these advancements, extending the learned image compression technique to a general image format for better RD performance remains challenging. Our research is driven by the need to use this technique to process any image format and improve RD performance without compromising en/decode speed. The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25) We find that by converting the image (regardless of the number of channels) into a one-channel two-dimensional matrix and combining it with an adaptive attention residual block (AARB). It can effectively avoid the negative effects caused by channel correlation and improve the homogeneity index of the potential representation, which in turn improves the RD performance, as Figure 1. In this paper, we contribute to this learned image compression field in the following ways: We propose a General Format Learned Image Compression (GLIC) model. It can unify any image format into a general format for general image compression. It obtains better RD performance than existing models across multiple datasets. We improve an Adaptive Attention Residual Block (AARB). It extracts the high-level and overall semantic information of an image separately, and obtains their weights adaptively to make the model more applicable. We propose an evenly grouped Cross-Channel Context (CCCT) module. It priority analyzes the overall information of the image, and then gradually decodes the complete image information to achieve fast and high-quality progressive image decoding. Related Works Traditional hand-crafted image compression algorithms are vital for Internet applications. However, with the growing demand for massive data storage across various industries, these traditional methods struggle to meet future needs. Researchers are exploring ways to improve image compression efficiency using neural network techniques. One of the earliest techniques combines the Variational Auto Encoding model (VAE) with the Image Entropy Coding, proving the feasibility of neural network in image compression. The subsequent introduction of the Hyperprior model, the Autoregressive context module, and the Parallel context module has further improved the rate-distortion performance and co/decode speed of learned image compression, making it more suitable for practical applications. Basic model Learned lossy image compression (Ball e et al. 2018; Cheng et al. 2020; He et al. 2021, 2022; Mentzer et al. 2018; Minnen, Ball e, and Toderici 2018; Minnen and Singh 2020; Ball e, Laparra, and Simoncelli 2016; Yang and Mandt 2024; Li et al. 2024; Liu, Sun, and Katto 2023) is an end-to-end optimization technique that combines transform coding (Goyal 2001; Ball e et al. 2020) and entropy coding techniques (Rissanen and Langdon 1981; Martin 1979; Van Leeuwen 1976), its optimization objective requires a trade-off between ratedistortion and image distortion: L = R(ˆy) + λ D(x, gs(ˆy)) = Ex px[ log2 pˆy|ˆz(ˆy|ˆz) log2 pˆz(ˆz)] + λ Ex px[d(x, ˆx)] (1) where R is the rate term, D is the distortion term, and λ is the Lagrange Multiplier Hyperparameter, which is used to control the distortion trade-off. As Figure 2a, x = {x1, x2, x3} is the original image, x1, x2 and x3 are the three channels of a RGB image. ˆx = {ˆx1, ˆx2, ˆx3} is the decoded image. y is the latent representation before quantization, and ˆy are discrete symbols that need to be persistently saved after entropy encoding. ˆy = ga(x) is a quantization of the latent representation obtained by the image analysis transform y = ga(x), is a quantitative operation. ˆx = gs(ˆy) indicates that a decoded image is obtained by image synthesis transform from the quantization result. U|Q are quantization and entropy coding operations. During training, quantization is emulated using uniform noise U 1 2 to produce noisy codes y. In the en/decode stage, U|Q denotes the actual round quantization responsible for generating ˆy. Hyperprior and Context model (Ball e et al. 2018) posit that the elements of ˆy exhibit a significant spatial scale dependence. To encapsulate this spatial dependence, they introduce an additional set of random variables, denoted as z in Figure 2b. ˆy is modeled as a Gaussian distribution with zero-mean and a standard deviation of σ. This standard deviation is predicted by applying the parametric transformation hs to z. The compressed hyperprior could be added as edge information to the bitstream in persistent storage, which allows the synthesizer to use the conditional entropy model. (Minnen, Ball e, and Toderici 2018), based on the work of (Ball e et al. 2018), joint the mean-scale hyperprior and added an autoregressive context module, denoted as Cm in Figure 2c. Although the autoregressive context model achieves better RD performance by correlating the currently decoded symbols with the already decoded symbols, its inability to perform parallel computation greatly affects the en/decode speed. Its usual coding and decoding time for a image reaches an intolerable number of seconds. Therefore, solving the parallelism problem of the context model becomes the key to improve the efficiency. Parallel context model (He et al. 2021) proposes to separate symbols (ˆy = {ˆy1, ..., ˆyh w}) as anchors and non-anchors (Eq 2) and implements parallel decoding of anchors and non-anchors using a checkerboard space context convolution Φsp,i = gsp(ˆy3} s . Conversely, g{>3} s shows a consistent trend of improved performance. General format results The general format learned image compression technique is a downscaling solution designed for image compression across various channels, not just limited to three-channel images. As Table 3, GLIC has seen successful implementation in multiple image formats with λ = 0.09 for the model. For example, this method has undergone testing on specialized medical imaging datasets, specifically the subset0 dataset from Lung Nodule Analysis 2016 (LUNA16) (Setio et al. 2017), and the validation set from the Musculoskeletal Radiographs (MURA) (Rajpurkar et al. 2018). Beyond the realm of medical imaging, GLIC can also be extended to archive large volumes of historical Radio Frequency (RF) data in electromagnetic spectrum management. An example of this involves the In-phase and Quadrature phase (I/Q) data from Bluetooth devices (Uzundurukan, Dalveren, and Kara 2020). In this scenario, we convert the I/Q data into a one-channel matrix (grayscale image) for storage. These outcomes verify the effectiveness of GLIC in handling images of arbitrary channel, demonstrating its potential as a general format image compression solution. Qualitative results Subjective visual evaluation is an important qualitative metric for judging the quality of lossy image compression. Therefore, we perform a visual comparison. The focus is on comparing the texture details preserved in decoded images at similar compression bit rates. As Figure 11, shows the reconstructed image and details. Compared to other methods, GLIC has better RD performance and visual discrimination, and our method preserves more texture details. Figure 11: Comparison of reconstructions of c26847af1470e880236db4766b42c09d.png (CLIC Organizing Committee 2024). (a) Channel number. (b) AARB and RAB. (c) Context module. Figure 12: Ablation experiment. Ablation Study Number of channels To confirm the one-channel model s superiority for entropy coding, we compared it with the traditional three-channel model. We keep the structure of the main, hyperpriori, and context modules constant to control for ablation variables, the only difference being the number of model channels. As Figure 12a gives the evaluation results on Kodak(Eastman Kodak Company 1999). The RD performance of the one-channel model is significantly higher than that of the three-channel model. This strengthens the assertion that the multi-channel model s homomorphism feature is negatively impacted by inter-channel information. Furthermore, we observe an unusual phenomenon: when the multi-channel model is paired with the CCCT, the model inconsistently fails to encode and decode the same image, meaning it can t reliably compress images. However, when the multi-channel model is used alongside the channel-wise module, this issue doesn t occur. This is the reason why the RD performance plummets when the three-channel model is combined with CCCT in the experimental results. We consider that the multi-channel model produces significant channel correlation, which can be eliminated by CCCT. To summarize, it is clear that channel interactions can negatively affect entropy coding. The one-channel model solves this problem by eliminating this strong channel correlation. AARB and RAB module To verify the effectiveness of our improve AARB module in combination with the one- channel model. We replace only the AARB with the RAB proposed by (Cheng et al. 2020) to train the model with 8 levels, where the hyperparameters λ are set to: {0.0015, 0.004, 0.008, 0.015, 0.025, 0.04, 0.06, 0.08}. The RD curves are shown in Figure 12b. The AARB module combined with the one-channel model structure achieves better RD performance. Combined with the model structure analysis, it can be seen that the attention branch of AARB has the same ability to extract local deep semantic information as the RAB module, and the non-attention branch can better retain the overall semantic information. The adaptive module weight assignment block can effectively discriminate the importance of overall and local semantic information. CCCT and Channel-wise context module The difference between CCCT and channel-wise context module (Minnen and Singh 2020) is that CCCT can earlier extract the overall information of the image and then gradually obtain the complete information. To verify the effectiveness of CCCT, we only replace the context module for experiments to compare the RD performance of the preview image for decoding half of the chunks, i.e., g(4) s . As Figure 12c gives the conclusions drawn on Kodak (Eastman Kodak Company 1999). The complete decoded image RD performance of both context modules is almost the same, but the preview image RD performance of CCCT is better than channel-wise context. This result confirms that CCCT is able to capture the overall image information earlier and is more suitable for progressive image decoding tasks. We propose the General Format Learned Image Compression (GLIC) model, which can be widely applied to various image compression tasks. We analyze the intermediate feature maps of the multi-channel and the one-channel model, which reveals that separating the inter-channel correlations can enhance the homogeneity of the feature maps without reducing their information entropy. This is more favorable for image entropy coding. Therefore, we propose a scheme to unify any images into a general format, and combine it with an improved Adaptive Attention Residual Block to achieve excellent rate-distortion performance. Further, we propose Cross-Channel Context Module, which quickly captures the overall semantic information and realizes high-quality progressive preview image decoding. Our experimental results show that GLIC achieves excellent improvements in both rate-distortion performance and model applicability. Acknowledgments This work was funded by Intelligent Policing Key Laboratory of Sichuan Province (ZNJW2024KFZD004), the key R&D project jointly implemented by Sichuan and Chongqing in 2020 (cstc2020jscx-cylh X0004), and the Intelligent Policing and National Security Risk Management Laboratory (ZHZZZD2301). Ali, M. S.; Kim, Y.; Qamar, M.; Lim, S.-C.; Kim, D.; Zhang, C.; Bae, S.-H.; and Kim, H. Y. 2024. Towards efficient image compression without autoregressive models. Advances in Neural Information Processing Systems, 36. Ball e, J.; Chou, P. A.; Minnen, D.; Singh, S.; Johnston, N.; Agustsson, E.; Hwang, S. J.; and Toderici, G. 2020. Nonlinear transform coding. IEEE Journal of Selected Topics in Signal Processing, 15(2): 339 353. Ball e, J.; Laparra, V.; and Simoncelli, E. P. 2016. Endto-end optimized image compression. ar Xiv preprint ar Xiv:1611.01704. Ball e, J.; Minnen, D.; Singh, S.; Hwang, S. J.; and Johnston, N. 2018. Variational image compression with a scale hyperprior. ar Xiv preprint ar Xiv:1802.01436. B egaint, J.; Racap e, F.; Feltman, S.; and Pushparaja, A. 2020. Compress AI: a Py Torch library and evaluation platform for end-to-end compression research. ar Xiv preprint ar Xiv:2011.03029. Bellard, F. 2015. BPG image format. URL https://bellard. org/bpg, 1(2): 1. Chen, H.; Gu, J.; and Zhang, Z. 2021. Attention in attention network for image super-resolution. ar Xiv preprint ar Xiv:2104.09497. Cheng, Z.; Sun, H.; Takeuchi, M.; and Katto, J. 2020. Learned image compression with discretized gaussian mixture likelihoods and attention modules. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7939 7948. CLIC Organizing Committee. 2024. Workshop and Challenge on Learned Image Compression. [Online]. Available: http://compression.cc/tasks/. The 6th Challenge on Learned Image Compression. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248 255. Ieee. Eastman Kodak Company. 1999. Kodak Lossless True Color Image Suite. Download from http://r0k.us/graphics/kodak/. Accessed: November 15, 1999. Gao, C.; Xu, T.; He, D.; Wang, Y.; and Qin, H. 2022. Flexible neural image compression via code editing. Advances in Neural Information Processing Systems, 35: 12184 12196. Gonzales, R. C.; and Wintz, P. 1987. Digital image processing. Addison-Wesley Longman Publishing Co., Inc. Goyal, V. K. 2001. Theoretical foundations of transform coding. IEEE Signal Processing Magazine, 18(5): 9 21. Haralick, R. M.; Shanmugam, K.; and Dinstein, I. H. 1973. Textural features for image classification. IEEE Transactions on systems, man, and cybernetics, (6): 610 621. He, D.; Yang, Z.; Peng, W.; Ma, R.; Qin, H.; and Wang, Y. 2022. Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5718 5727. He, D.; Zheng, Y.; Sun, B.; Wang, Y.; and Qin, H. 2021. Checkerboard context model for efficient learned image compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14771 14780. Jeon, S.; Choi, K. P.; Park, Y.; and Kim, C.-S. 2023. Context Based Trit-Plane Coding for Progressive Image Compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14348 14357. Jiang, W.; Yang, J.; Zhai, Y.; Ning, P.; Gao, F.; and Wang, R. 2023. Mlic: Multi-reference entropy model for learned image compression. In Proceedings of the 31st ACM International Conference on Multimedia, 7618 7627. Johnston, N.; Vincent, D.; Minnen, D.; Covell, M.; Singh, S.; Chinen, T.; Hwang, S. J.; Shor, J.; and Toderici, G. 2018. Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4385 4393. Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980. Lee, J.; Jeong, S.; and Kim, M. 2022. Selective compression learning of latent representations for variable-rate image compression. Advances in Neural Information Processing Systems, 35: 13146 13157. Lee, J.-H.; Jeon, S.; Choi, K. P.; Park, Y.; and Kim, C.-S. 2022. DPICT: Deep progressive image compression using trit-planes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16113 16122. Li, Y.; Xu, T.; Wang, Y.; Liu, J.; and Zhang, Y.-Q. 2024. Idempotent Learned Image Compression with Right Inverse. Advances in Neural Information Processing Systems, 36. Lieberman, K.; Diffenderfer, J.; Godfrey, C.; and Kailkhura, B. 2023. Neural Image Compression: Generalization, Robustness, and Spectral Biases. In ICML 2023 Workshop Neural Compression: From Information Theory to Applications. Liu, J.; Sun, H.; and Katto, J. 2023. Learned image compression with mixed transformer-cnn architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14388 14397. Martin, G. N. N. 1979. Range encoding: an algorithm for removing redundancy from a digitised message. In Proc. Institution of Electronic and Radio Engineers International Conference on Video and Data Recording, volume 2. Mentzer, F.; Agustsson, E.; Tschannen, M.; Timofte, R.; and Van Gool, L. 2018. Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4394 4402. Minnen, D.; Ball e, J.; and Toderici, G. D. 2018. Joint autoregressive and hierarchical priors for learned image compression. Advances in neural information processing systems, 31. Minnen, D.; and Singh, S. 2020. Channel-wise autoregressive entropy models for learned image compression. In 2020 IEEE International Conference on Image Processing (ICIP), 3339 3343. IEEE. Nian, Y.; Liu, Y.; and Ye, Z. 2016. Pairwise KLT-based compression for multispectral images. Sensing and Imaging, 17: 1 15. Norkin, A.; Grange, A.; Concolato, C.; Katsavounidis, I.; Tmar, H.; Mammou, K.; Liu, S.; and Baliga, R. 2022. Alliance for open media (aomedia) progress report. SMPTE Motion Imaging Journal, 131(8): 88 92. Pennebaker, W. B.; and Mitchell, J. L. 1992. JPEG: Still image data compression standard. Springer Science & Business Media. Pintus, M.; Ginesu, G.; Atzori, L.; and Giusto, D. D. 2012. Objective evaluation of webp image compression efficiency. In Mobile Multimedia Communications: 7th International ICST Conference, MOBIMEDIA 2011, Cagliari, Italy, September 5-7, 2011, Revised Selected Papers 7, 252 265. Springer. Rajpurkar, P.; Irvin, J.; Bagul, A.; Ding, D.; Duan, T.; Mehta, H.; Yang, B.; Zhu, K.; Laird, D.; Ball, R. L.; Langlotz, C.; Shpanskaya, K.; Lungren, M. P.; and Ng, A. Y. 2018. MURA: Large Dataset for Abnormality Detection in Musculoskeletal Radiographs. ar Xiv:1712.06957. Rezasoltani, S.; and Qureshi, F. Z. 2023. Hyperspectral Image Compression Using Implicit Neural Representations. In 2023 20th Conference on Robots and Vision (CRV), 248 255. IEEE. Rissanen, J.; and Langdon, G. 1981. Universal modeling and coding. IEEE Transactions on Information Theory, 27(1): 12 23. Setio, A. A. A.; Traverso, A.; De Bel, T.; Berens, M. S.; Van Den Bogaard, C.; Cerello, P.; Chen, H.; Dou, Q.; Fantacci, M. E.; Geurts, B.; et al. 2017. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge. Medical image analysis, 42: 1 13. Taubman, D. S.; and Marcellin, M. W. 2002. JPEG2000: Standard for interactive imaging. Proceedings of the IEEE, 90(8): 1336 1357. Uzundurukan, E.; Dalveren, Y.; and Kara, A. 2020. A Database for the Radio Frequency Fingerprinting of Bluetooth Devices. Data, 5(2). Van Leeuwen, J. 1976. On the Construction of Huffman Trees. In ICALP, 382 410. Wang, Z.; Simoncelli, E. P.; and Bovik, A. C. 2003. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, 1398 1402. Ieee. Wien, M.; and Bross, B. 2020. Versatile video coding algorithms and specification. In 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP), 1 3. IEEE. Xu, T.; Wang, Y.; He, D.; Gao, C.; Gao, H.; Liu, K.; and Qin, H. 2022. Multi-sample training for neural image compression. Advances in Neural Information Processing Systems, 35: 1502 1515. Yang, R.; and Mandt, S. 2024. Lossy image compression with conditional diffusion models. Advances in Neural Information Processing Systems, 36.