# learned_image_transmission_with_hierarchical_variational_autoencoder__5ce50a48.pdf Learned Image Transmission with Hierarchical Variational Autoencoder Guangyi Zhang*, Hanlei Li*, Yunlong Cai, Qiyu Hu, Guanding Yu, Runmin Zhang College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China {zhangguangyi, hanleili, ylcai, qiyhu, yuguanding, runmin zhang}@zju.edu.cn In this paper, we introduce an innovative hierarchical joint source-channel coding (HJSCC) framework for image transmission, utilizing a hierarchical variational autoencoder (VAE). Our approach leverages a combination of bottom-up and top-down paths at the transmitter to autoregressively generate multiple hierarchical representations of the original image. These representations are then directly mapped to channel symbols for transmission by the JSCC encoder. We extend this framework to scenarios with a feedback link, modeling transmission over a noisy channel as a probabilistic sampling process and deriving a novel generative formulation for JSCC with feedback. Compared with existing approaches, our proposed HJSCC provides enhanced adaptability by dynamically adjusting transmission bandwidth, encoding these representations into varying amounts of channel symbols. Additionally, we introduce a rate attention module to guide the JSCC encoder in optimizing its encoding strategy based on prior information. Extensive experiments on images of varying resolutions demonstrate that our proposed model outperforms existing baselines in rate-distortion performance and maintains robustness against channel noise. Introduction To meet the transmission requirements of heavy data traffic in future sixth-generation (6G) networks, wireless edge devices need to be equipped with higher transmission efficiency. Most contemporary systems employ a two-step strategy for data transmission: first, the raw data is compressed using a source codec, such as JPEG (Wallace 1992) and BPG (Lainema et al. 2012). Then, the encoded bits are protected with redundancy introduced by a carefully designed channel codec, such as LDPC and Polar codes (Arikan 2009). However, in many practical applications, the bit length is generally finite, making it impossible to guarantee optimality. In this context, joint source-channel coding (JSCC) has emerged as a potential solution, offering higher coding gains than the traditional separation-based coding paradigm. With the revolutionary progress of deep learning in various fields, such as image compression (Ball e et al. 2018; He et al. 2021; Xu et al. 2022b) and generative models *These authors contributed equally. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. (Kingma and Welling 2022; Razavi, van den Oord, and Vinyals 2019), a novel design paradigm for JSCC, called learned image transmission (LIT), has been conceived by formulating the communication pipeline as an end-to-end deep learning model (Bourtsoulatze et al. 2019; Zhang et al. 2024; Sun et al. 2023). Specifically, these methods leverage powerful neural networks to implement the encoding and decoding processes. In this approach, the whole system is viewed as an autoencoder (AE), which can be jointly learned in a data-driven manner. A notable method proposed by (Bourtsoulatze et al. 2019) employed CNNs to construct the source and channel codecs for wireless image transmission, achieving great performance by mapping the input image directly into channel symbols. Moreover, Kurka and G und uz; Wu et al. investigate the JSCC with feedback. In this context, the transmission of these representations is divided into multiple phases, with the transmitter receiving the channel symbol vector after each phase, which simplifies the encoding process and improves the overall performance. Beyond deterministic AEs, some studies have employed variational autoencoders (VAE) to design JSCC systems (Choi et al. 2019; Bo et al. 2024; Hu et al. 2023), where the channel symbols are generated through sampling. Part of these VAE-based methods show superior performance compared to deterministic AE-based methods, particularly under severe channel conditions. Though VAE-based methods have demonstrated remarkable performance, they experience significant performance degradation on high-resolution images. Furthermore, most existing methods only support fixed-rate coding, which contrasts with emerging works on transform coding-based image compression (Ball e et al. 2018), where the compression rate for each image is determined by the estimated entropy of its feature representation and varies with different samples. Consequently, these methods are less flexible and adaptive, potentially leading to performance penalties. In this work, we aim to overcome the limitations of previous methods while enhancing performance. Specifically, we develop a hierarchical JSCC (HJSCC) framework based on a powerful hierarchical VAE architecture (Child 2020). Our transmitter employs both bottom-up and top-down paths to autoregressively generate multiple hierarchical representations of the original image. These representations are then mapped to channel symbols using multiple JSCC encoder The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25) blocks. Building upon this, we further explore the application of HJSCC in a classical scenario where a feedback link exists. By modeling transmission over a noisy channel as a probabilistic sampling process, we derive a novel generative formulation for JSCC with feedback, which achieves significantly better performance than most existing advanced schemes. While there have been attempts at variable-rate transmission (Dai et al. 2022; Song et al. 2023; Zhang et al. 2023; Yang and Kim 2022) in the realm of JSCC without feedback, the problem of rate-adaptive design for JSCC with feedback remains underexplored. Unlike existing works (Kurka and G und uz 2020; Wu et al. 2024a; Li et al. 2024; Jiang et al. 2022), we leverage the prior distribution (which characterizes the entropy information) of each representation to generate masks that control the number of symbols for each representation. This approach allows us to dynamically adjust the transmission rate. Additionally, we introduce a rate attention module to guide the JSCC encoder in adjusting the encoding strategy according to its prior information. In summary, our contributions are as follows: HJSCC Framework: Developing a hierarchical scheme that is able to support the transmission of high-resolution images. HJSCC with Feedback: Extending HJSCC to the case with feedback, by viewing the transmission as a sampling process and deriving a generative formulation. Dynamic Rate Control: By utilizing the entropy information of representations to dynamically control the transmission rate, this approach bridges the gap of lacking rate-adaptive design when a feedback link is present. Rate Attention Module: Proposing a spatial grouping strategy and a rate attention module to improve the overall rate-distortion performance. Experimental Studies: Providing substantial experiments to verify the effectiveness of the proposed method, demonstrating that the proposed scheme achieves better coding gain than emerging deep learning-based JSCC and separation-based digital transmission schemes. Related Works Varational Autoencoder VAE can be employed as deep generative models capable of generating high-dimensional data based on a low-dimensional latent space, and is a variant of autoencoder (Sø nderby et al. 2016; Cemgil et al. 2020). By sampling from the learned latent space distribution and passing these samples through the decoder network, VAE can generate new data points that resemble the training data. However, the original VAE is known to perform worse than many other generative models, particularly when applied to high-resolution images. (Vahdat and Kautz 2020; Child 2020; Yuhta Takida 2024) addressed this by proposing a deep hierarchical VAE, where the latent variable is divided into several disjoint groups, achieving significantly better performance than standard VAEs. More recently, VAE has been applied to compression tasks (Duan et al. 2023; Lu et al. 2024; Townsend, Bird, and Barber 2019; Kingma, Abbeel, and Ho 2019). (a) VAE (b) Bidirectional encoder and generative decoder of a hierarchical VAE Figure 1: Probabilistic model of VAEs and hierarchical Res Net VAE. The bias is a trainable parameter. Learned Image Transmission Unlike the separationbased design described above, recent studies have delved into the utilization of AE and its variants, e.g., VAE to design wireless image transmission systems, resulting in a number of efficient methods (Bourtsoulatze et al. 2019; Xu et al. 2022a; Zhang et al. 2024; Sun et al. 2023; Xu et al. 2022a; Yang et al. 2024; Saidutta, Abdi, and Fekri 2019). In particular, (Bourtsoulatze et al. 2019; Wu et al. 2024b) and (Yang et al. 2024) conceived of using neural networks to simultaneously finish the source encoding/decoding and channel coding/decoding, with the goal of jointly optimizing the entire system to maximize PSNR. The VAE-based methods adopted probabilistic modeling, wherein the encoding process is characterized as a stochastic procedure. In these systems, channel symbols are generated by sampling from the probability distribution conditioned on the input image (Bo et al. 2024; Saidutta, Abdi, and Fekri 2019). These approaches have demonstrated superior performance, especially under severe transmission conditions. Proposed Methods Background VAE and Hierarchical VAE As stated in (Cemgil et al. 2020), VAE is a stochastic variational inference scheme that can be applied to various intelligent tasks, such as recognition, denoising, and generation. As shown in Fig. 1(a), to formulate a vision-related model, we typically start with the following premises. Let x denote an image intensity vector, which is drawn from a dataset X with distribution px. Another variable is the latent variable z, with a prior pz. The main target of a VAE is to learn a generative model (decoder) px|z for sampling, and a posterior density model (encoder) qz|x for variational inference. The objective for learning a VAE model can be formulated as minimizing the (variational) upper bound on the marginal likelihood of a batch of data points, as given by L (θ, ϕ) = Ex px,z qz|x DKL qz|x pz log px|z , (1) where DKL qz|x pz = log qz|x/pz , θ and ϕ represent the parameters of the encoder qz|x and decoder px|z, respectively. Hierarchical VAEs are a series of VAE models that partition the latent variables into several disjoint groups. The Encoder Decoder Figure 2: Diagram of a deep learning-based JSCC system. probabilistic diagram of a classical hierarchical VAE, the Res Net VAE, is shown in Fig. 1(b), consisting of a bidirectional encoder and a generative decoder. Specifically, the latent variables can be denoted by z {z1, z2, . . . , z L}, where L represents the number of groups. The prior for z is modeled as pz1:L = Q l pzl|z