# learning_neural_vocoder_from_rangenull_space_decomposition__85809d35.pdf Learning Neural Vocoder from Range-Null Space Decomposition Andong Li1,2 , Tong Lei3,4 , Zhihang Sun3 , Rilin Chen3 , Erwei Yin5,6 , Xiaodong Li1,2 Chengshi Zheng1,2 1Institute of Acoustics, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Tencent AI Lab 4Nanjing University 5 Defense Innovation Institute, Academy of Military Sciences (AMS) 6 Tianjin Artificial Intelligence Innovation Center (TAIIC) cszheng@mail.ioa.ac.cn Despite the rapid development of neural vocoders in recent years, they usually suffer from some intrinsic challenges like opaque modeling, and parameter-performance trade-off. In this study, we propose an innovative time-frequency (T-F) domain-based neural vocoder to resolve the abovementioned challenges. To be specific, we bridge the connection between the classical signal range-null decomposition (RND) theory and vocoder task, and the reconstruction of target spectrogram can be decomposed into the superimposition between the range-space and null-space, where the former is enabled by a linear domain shift from the original mel-scale domain to the target linear-scale domain, and the latter is instantiated via a learnable network for further spectral detail generation. Accordingly, we propose a novel dual-path framework, where the spectrum is hierarchically encoded/decoded, and the crossand narrow-band modules are elaborately devised for efficient sub-band and sequential modeling. Comprehensive experiments are conducted on the LJSpeech and Libri TTS benchmarks. Quantitative and qualitative results show that while enjoying lightweight network parameters, the proposed approach yields state-of-the-art performance among existing advanced methods. Our code and the pretrained model weights are available at https://github.com/Andong-Li-speech/RNDVo C. 1 Introduction Sound vocoder aims to reconstruct the audible time-domain waveforms using electronic and computational techniques, which is widely employed in text-to-speech (TTS) [Wang et al., 2017; Huang et al., 2022a; Li et al., 2025], textto-audio (TTA) [Liu et al., 2023], and speech enhancement [Zhou et al., 2024; Li et al., 2022]. Compared with traditional digital signal processing (DSP)-based vocoders like Chengshi Zheng is the corresponding author. 0 20 40 60 80 100 Model Parameters (million) PESQ scores vs. Models Parameters 0.0 0.5 3.50 RNDVo C (Ours) ite RNDVo C-L (Ours) ite DVo C-Ultra L RN urs) (O 4) os(ICLR 202 Voc e ig VGAN-bas B CLR 2023) (I Univ Net(Interspeech 2021) Hi Fi GAN(Neur IPS 2020) i STFTNet (ICASSP 2022) APNet2 (MMSC 2024) APNet (TASLP 2023) Wave Glow (ICASSP 2019) Big VGAN (ICLR 2023) Figure 1: A case comparison in terms of PESQ score on the Libri TTS benchmark. A larger bubble denotes higher computational complexity. Note that both Big VGAN and our methods are trained for 1M steps herein. STRAIGHT [Kawahara, 2006] and WORLD [Morise et al., 2016], neural vocoders enjoy superior advantage in generation naturalness and quality, thereby garnering wide attention in recent years. In the primitive stage, neural vocoders typically generate waveforms sample by sample, and representative works include Wave Net [Van Den Oord et al., 2016] and Wave RNN [Kalchbrenner et al., 2018]. Despite the improvement, they often suffer from considerably slow inference efficiency due to their autoregressive processing characteristic. To alleviate the deficiency, some other schemes have subsequently been adopted, including knowledge distillationbased [Oord et al., 2018], normalization flow-based [Ping et al., 2020], and glottis-based [Juvela et al., 2019]. Recently, generative adversarial network (GAN)-based neural vocoders have gained increasing attention due to their non-autogressive structure and high generation quality. Pioneering works include Mel GAN [Kumar et al., 2019] and Hi Fi GAN [Kong et al., 2020], where the mel-spectrograms are gradually upsampled and recovered to waveforms via alternate residual blocks and upsampling layers. Except for the time domain based Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) methods, time-frequency (T-F) domain based neural vocoders have been proposed in more recent years, which enjoy faster inference speed and promising generation quality [Siuzdak, 2024]. Despite the success of existing neural vocoders, some inherent challenges still exist and can impede the further progress. First, most of the end-to-end neural vocoders still suffer from the typical parameter-performance tradeoff. For example, although achieving impressive performance, the number of parameters of Big VGAN is as high as 112M [Lee et al., 2023], which is awkward and also inconvenient for practical applications. Second, while T-F domain based neural vocoders exhibit the advantage in inference efficiency, their performance often lags behind mainstream time-domain-based ones. For the first point, we argue that existing methods usually model the vocoder process as a black-box, neglecting the utilization the linear degradation prior of mel-spectrogram1. Therefore, excessive parameters are often required for overall target reconstruction. For the second point, short-time Fourier transform (STFT) operation explicitly decouples the frequency information among different sub-bands, and acoustic features usually lie in specific frequency regions. However, only full-band modeling is adopted in existing T-F domain based neural vocoders, neglecting the hierarchical characteristic of the T-F spectrum. To this remedy, we propose a novel T-F domain based neural vocoder called RNDVo C to tackle the above-mentioned challenges. Specifically, we revisit the formulation of melspectrogram and establish the connection with range-null decomposition (RND) theory. By doing so, the reconstruction of target spectrum is thus decomposed into range-space modeling (RSM) and null-space modeling (NSM), where the former aims to project the original mel-spectrogram into the corresponding counterpart defined in the target linear-scale domain, and the latter is responsible for spectral details generation. In this way, we provide a more transparent and also elegant perspective for framework design. Besides, we devise an efficient dual-path structure, where the spectrum is hierarchically encoded and decoded, and the cross-band and narrow-band modules are alternate adopted for sub-band and sequential modeling. Experimental results show that with only only 2.8% parameters and 8.17% computational complexity, our method yields comparable and even better performance over Big VGAN-112M version in both objective and subjective scores, as well as nearly 10 speed-up on a CPU. Meanwhile, when the network parameters is further reduced as low as 0.08M, the performance is still comparable to existing baselines. Figure 1 showcases an example of the PESQ scores using many existing representative methods on the Libri TTS benchmark. Our contributions can be summarized as four-fold: We provide a novel range-null space decompostion perspective to tackle the neural vocoder task. To our best knowledge, this is the first time to leverage the RND theory for audio generation. We devise a novel dual-path framework to en- 1Strictly speaking, mel-spectrogram is linearly compressed in the spectral magnitude, and the phase information is dropped. Reconstructed spectrogram range Range-Space Pha. Condition Figure 2: Illustrations of the proposed RNDVo C. code/decode the spectrum hierarchically, and cross-band and narrow-band modules are employed for efficient sub-band and sequential modeling. We conduct extensive experiments to reveal the superiority of our method over existing baselines. We provide an ultra-light version with only around 80K trainable parameters. To our best knowledge, this is the smallest end-to-end neural vocoder up to now. 2 Related Works 2.1 DSP-based Methods In conventional DSP-based vocoders, the speech is usually generated with statistical parameter estimation. In STRAIGHT [Kawahara, 2006], the excitation and resonant parameters are separately estimated for real-time speech manipulation and synthesis. In [Morise et al., 2016], the F0, spectral envelop, and aperiodic parameters are first determined using analysis algorithms. A synthesis method is subsequently devised for the time-domain waveform generation. Although being simplicity, the synthesized speech quality is often unsatisfactory and may include some buzzing artifacts. 2.2 Neural Vocoder Methods Autoregressive methods: In Wave Net [Van Den Oord et al., 2016], consecutive Wave Net modules with increasing dilation rates are adopted for sample-level generation. Wave RNN [Kalchbrenner et al., 2018], samples are autoregressively generated with RNN layers. In LPCNet [Valin and Skoglund, 2019], the linear prediction coefficients (LPC) are estimated to predict the next sample and a lightweight RNN is utilized for residual calculation. Despite the improvements over traditional DSP-based vocoders, the autoregressive nature can suffer from rather slow inference efficiency. Flow-based methods: Normalization flow is regarded as a classical generation paradigm. Typical flow-based neural vocoders include Wave Glow [Prenger et al., 2019], Flow Wave Net [Kim et al., 2019], and Real NVP [Dinh et al., 2016], where the bijective mapping is established between a normalization probability distribution and target data distribution through stacked invertible modules. GAN-based methods: In Mel GAN [Kumar et al., 2019], the mel-spectrogram is gradually upsampled through consecutive upsampling layers and residual blocks. In Hi Fi GAN, to facilitate the periodic pattern generation, the multi-periodic and multi-scale discriminators are employed for improved speech Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) HSEM Hierarchical Spectral Encoding Module HMDM Hierarchical Magnitude Decoding Module HPDM Hierarchical Phase Decoding Module DPB Dual-path Block CAT. Concatenation GConv1d(k3s1g8) Conv1d+Si LU Conv1d+Si LU GConv1d(k3s1g8) DWConv1d(k7s1) Conv1d(k1s1) Conv1d(k1s1) N C T N C T Cross-Band Module Narrow-Band Module Range-Space Module (a) Main Network Null-Space Module DPB DPB ... DPM HPDM F T null | | F T null Conv2d 1x1+LN+GELU Conv2d 1x1+LN+GELU Conv2d 1x1+LN+GELU (c) HMDM/HPDM Figure 3: Network structure of the proposed RNDVo C. (a) Overall network structure of the RNDVo C. (b) Detailed structure of the hierarchical spectral encoding module (HSEM). (c) Detailed structure of the hierarchical magnitude decoding module (HMDM) and hierarchical phase decoding module (HPDM). (d) Detailed structure of the dual-path block (DPB). Diffferent modules are indicated with different colors. generation. Big VGAN [Lee et al., 2023] incorporates the periodic activation function and anti-aliased representation into the generator to boost the periodic components generation, and the network size is further scaled up to 112M for universal audio vocoding. Except for the above-mentioned timedomain methods, several works explore the feasibility in the T-F domain. Vocos [Siuzdak, 2024] stacks multiple Conv Next v2 blocks [Woo et al., 2023] for magnitide and phase estimation. In [Ai and Ling, 2023], the magnitude and phase components are separately modeled with Res Net blocks. Diffusion-based methods: Owing to the powerful generation capability of duffusion methods, some works also adopt diffusion for neural vocoder. Representative works include Diff Wave [Kong et al., 2021], Wave Grad [Chen et al., 2021], and Prior Grad [Lee et al., 2022], where the waveforms gradually degrade to Gaussian noise in the forward path and the target signals can be iteratively generated in the reverse process. Despite good performance, the implemention efficiency can be slow due to numerous iteration steps, and fast sampling strategies [Huang et al., 2022b; Lu et al., 2022] are required to reduce the inference cost. In this section, we first introduce the range-null decomposition (RND) theory and formulate the problem of the speech vocoder task, and then we demonstrate their interconnection. Finally, the proposed learning framework is elucidated. 3.1 Range-Null Space Decomposition For a classical signal compression physical model: y = Ax + n, (1) where x RD, {y, n} Rd denote the target, observed, and noise signals, respectively, and A Rd D denotes the compression matrix, with d D. In the noise-free scenario, Eq.(1) can be simplified into y = Ax. If the pseudo-inverse of A is defined as A RD d, which satisfies AA A A, then the signal x can be decomposed into two orthogonal subspaces: one residing in the range-space of A, and the other in the null-space: x A Ax + I A A x, (2) where A Ax defines the range-space component and I A A x corresponds to the remaining null-space component. For practical solution of x, i.e., x, two consistency conditions should be satisfied: (1) Degradation consistency: A x y, where the original information embedded in the compressed signal space remains unaltered after reconstruction. (2) Data consistency: x p (x), where the estimation x should share the same data distribution to x. Based on the above constraints, the solution can be expressed as: x = A y + I A A ˆx, (3) where A y and I A A ˆx correspond to the solutions in the range-space and null-space, respectively. Due to the orthogonality characteristic, x naturally satisfies the first condition, i.e., A x AA Ax+A I A A ˆx Ax+0 y. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Therefore, we only need to consider the estimation of ˆx to meet the second condition, which can often be implemented with learnable networks. 3.2 Revisit Neural Vocoder Tasks In this paper, we mainly consider the scenario when the melspectrogram serves as the acoustic feature due to its simplicity and commonality in neural vocoders. The degradation process of a log-scale mel-spectrogram can be given by: Xmel = log (A|S|) , (4) where Xmel RFm T and |S| RF T denote the mel and target spectrograms, respectively, {Fm, F} are the frequency size in the meland linear-scale, T is the frame size. A RFm F refers to the mel-filter, which is instantiated by a linear compression matrix. It is evident that Eq. (4) involves two degradation operations: 1 Discarding the phase information, 2 Magnitude compression with a linear operation. Intuitively, two steps are necessarily involved in the inverse process: (1) Estimating the phase spectrum, (2) Recovering the magnitude spectrum from the compressed observation. In the preliminary literature, a neural network typically serves as the black-box to establish the mapping relation between target waveform/T-F spectrum and mel feature, which can be expressed as: s = FT Xmel, Θ1 , n | S|, Φ o = FT F Xmel, Θ2 , (5) where FT ( , Θ1) and FT F ( , Θ2) refer to the mapping function of time-domain and T-F domain based neural vocoders, respectively. s is the estimated time-domain waveform, and n | S|, Φ o denote the estimated spectral magnitude and phase, respectively. Note that if the log-operation is absorbed into the left of Eq. (4), i.e., X mel = exp Xmel = A|X|, it then exhibits a similar format to the noise-free case of Eq. (1). In other words, the original magnitude spectrum recovery can be formulated into a classical compressive sensing (CS) problem [Baraniuk et al., 2010], in which we attempt to reconstruct the target signal from a linearly-compressed representation. Following the explicit decomposition in Eq. (3), the abstract generation process of the target spectrum in Eq. (5) can be rewritten as: | S|range = Frange X mel = A X mel, (6) n | S|null, Φ o = Fnull | S|range , (7) | S| = | S|range + I A A | S|null = A A |S| + I A A | S|null, (8) S = | S|ej Φ, (9) where Frange ( ) and Fnull ( ) are the operations in the rangeand null-spaces, respectively, A RF Fm is the pseudoinverse of A. As shown in Figure 2, in Frange, the original Fixed Kernel Phase Spectrum (local) Convolved Feature Figure 4: Illustration of the proposed omnidirectional phase loss. acoustic feature in the mel-domain is transformed into the target domain via the pseudo-inverse operation. In Fnull, the remaining magnitude details are estimated to supplement the overall spectral reconstruction. Compared with preliminary methods, the reconstruction format in Eq. (8) enjoys several advantages: First, we fully utilize the linear degradation of mel-spectrum as the prior to establish two orthogonal subspaces, which enhances the overall framework interpretability. Besides, due to the degradation consistency, the original acoustic feature embedded in the mel-spectrogram can be well preserved, mitigating the acoustic distortion in the reconstructed signals. One should note that as only the spectral magnitude is involved in degradation formulation, we enforce Fnull to also estimate the phase component, as shown in Eq. (7). Besides, we notice that most recently, a similar pseudo-inverse strategy is adopted in [Lv et al., 2024]. The difference is that here we endow the pseudo-inverse operation with a more intuitive physical explanation while it is only adopted as an empirical trick in [Lv et al., 2024]. 3.3 Architecture Design of RNDVo C As previously illustrated, we explicitly decouple two orthogonal sub-spaces and devise the network modules for estimation. Figure 3(a) presents detailed framework of the RNDVo C. Given the input mel-spectrogram, it is processed by the range-space module (RSM), where the pseudo-inverse operation is adopted to shift the mel into the linear-scale domain. After that, the null-space module (NSM) is utilized, and three modules are mainly involved: hierarchical spectral encoding module (HSEM), dual-path module (DPM), and two decoder branches, i.e., hierarchical magnitude/phase decoding module (HMDM/HPDM), which will be introduced in a sequel. Hierarchical Spectral Encoder and Decoder Detailed network structure of the HSEM is shown in Figure 3(b). Given the input | S|range RF T , it is first split into I regions, and the i-th spectral region is notated as | S|range,i RFi T . As the major acoustic features lie in the low/mid frequency regions (e.g. fundamental frequency F0), we follow the from-fine-to-coarse principle for region division, that is, we gradually increase Fi with the increase Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Models Domain #Param. #MACs Inference Speed M-STFT PESQ MCD Periodicity V/UV Pitch VISQOL (M) (Giga/5s) CPU GPU RMSE F1 RMSE Hi Fi GAN-V1 T 13.94 152.90 0.1669(5.99 ) 0.0069(145 ) 1.1699 3.574 3.6711 0.1344 0.9474 33.6874 4.7706 i STFTNet-V1 T 13.26 107.76 0.0949(10.54 ) 0.0038(266 ) 1.1883 3.535 3.6252 0.1356 0.9466 35.1055 4.7557 Avocodo T 13.94 152.95 0.1611(6.21 ) 0.0063(158 ) 1.1653 3.604 3.6570 0.1386 0.9462 32.9887 4.7714 Big VGAN-base T 13.94 152.90 0.3856(2.59 ) 0.0368(27 ) 0.9784 3.603 2.3314 0.1198 0.9562 30.2774 4.8217 Big VGAN T 112.18 417.20 0.6674(1.50 ) 0.0507(20 ) 0.9001 4.107 1.8769 0.0838 0.9716 20.6922 4.8699 APNet T-F 72.19 31.11 0.03(33.33 ) 0.0022(458 ) 1.2659 3.390 3.2847 0.1508 0.9454 23.0571 4.6947 APNet2 T-F 31.38 13.53 0.0187(53.48 ) 0.0016(643 ) 0.9815 3.492 2.8288 0.1126 0.9592 25.3629 4.7515 Free V T-F 18.19 7.84 0.0139(71.94 ) 0.0011(951 ) 1.0076 3.593 2.7502 0.1118 0.9603 25.9922 4.7427 Vocos T-F 13.46 5.80 0.0066(151.52 ) 0.0007(1400 ) 1.0123 3.522 2.6699 0.1213 0.9559 29.1304 4.7741 RNDVo C(Ours) T-F 3.14 34.10 0.0669(14.96 ) 0.0043(233 ) 0.9103 3.987 2.0471 0.0854 0.9714 21.3183 4.8373 Table 1: Objective comparisons among different baselines on the LJSpeech benchmark. T and T-F refer to time-domain and T-F domain-based methods, respectively. a denotes the speed-up ratio over real-time. The inference speed on a CPU is evaluated based on a CPU Intel(R) Core(TM) i7-14700F. For GPU, it is based on NVIDIA Ge Force RTX 4060 Ti. The best and second-best performances are respectively highlighted in bold and underlined, respectively. in region index. For each spectral region, a separate Conv2d with stride is then utilized for spectral compression, followed by a layer-normalization (LN) layer [Lei Ba et al., 2016]. The process can be formulated as: Fin i = LN Conv2D | S|range,i RNi C T , (10) where Fin i is the compressed spectral feature of the i-th region, and {Ni, C} denote the sub-band and channel size, respectively. Here C is set to 256. Similarly, for the decoding process, two branches are utilized for magnitude and phase estimation, respectively. Taking the magnitude branch as an example, detailed structure is shown in Figure 3(c). Given the input feature O RN C T , it is first split into I regions, i.e., {O1, , OI}. For each region, we first adopt a separate point-wise Conv2d layer, followed by LN and GELU activation layer. After that, a Tr Conv2d is utilized to recover the spectral target. The formulation can be expressed as: Ki = Tr Conv2d (GELU (LN (Conv2d1x1 (Oi)))) . (11) For the magnitude branch, the exp ( ) is applied to guarantee the non-negativity. For the phase branch, Ki is split into real and imaginary components, and Atan2 ( ) is adopted for phase calculation. Note that in [Yu et al., 2023], a similar band-split strategy is adopted, where the spectrum is split into N non-uniform sub-bands, and each sub-band is separately encoded and decoded. The difference is that here the number of regions I is usually far less than N 2, and thus fewer loop operations are required, leading to faster inference speed. Besides, as the network weights within one spectral region are shared, fewer trainable parameters are needed. Dual-Path Block To facilitate the spectral information modeling, B = 6 dualpath blocks (DPBs) are stacked, each of which is shown in Figure 3(d). Specifically, it consists of a cross-band module and a narrow-band module, namely corresponding to the time and sub-band modeling. Given the input feature F(b) RN C T , it is first reshaped and sent to the cross-band module (CBM). Motivated 2In our practical settings, I = 3, and N = 24. by [Quan and Li, 2024], we adopt a light-weight design for cross-band modeling, and it involves three steps. First, a LN, followed by a group convolution with kernel being 3 along the sub-band axis and PRe LU [He et al., 2015], is adopted to model the correlation between neighboring sub-bands: F(b) = F(b) + PRe LU GConv1d LN F(b) . (12) After that, a Conv1d layer, followed by a Si LU activation function [Ramachandran et al., 2017], is used to squeeze the channel size to C , which is four times smaller than C. Then a band-mixer is adopted for global sub-band shuffling, which is instantiated via a linear matrix operation. After that, another Conv1D+Si LU is adopted to recover the channel size. The process can be expressed as: F(b) = F(b) + Si LU(Conv1d(Band Mixer(Si LU( Conv1d(LN(F(b) )))))). (13) Finally, to fully utilize the correlation between adjacent sub-bands, another group convolution with kernel being 3 is adopted with residual connection: F(b+1) = F(b) + PRe LU GConv1d LN F(b) . (14) For the narrow-band module (NBM), considering the success of Conv Next v2 blocks in neural vocoder task [Siuzdak, 2024], we stack P = 2 Conv Next v2 blocks to gradually capture long-term relations among adjacent frames. Different from the full-band modeling in previous works, here all subbands share the network parameters. Besides, to decrease the computation cost, the number of channels in both hidden and input layers remains unchanged, i.e., C. 3.4 Loss Function Following the settings in [Ai and Ling, 2023; Du et al., 2023], we adopt both the reconstruction loss and adversarial loss for vocoder training. The reconstruction loss consists of logamplitude loss La, phase loss Lp, real-imaginary loss Lri, mel-loss Lmel, and consistency loss Lc: Lrec = λa La + λp Lp + λri Lri + λmel Lmel + λc Lc, (15) Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Hi Fi GAN Zoom in i STFTNet Zoom in Avocodo Zoom in Big VGAN-base Zoom in Big VGAN Zoom in APNet2 Zoom in Vocos Zoom in RNDVo C(Ours) Zoom in Figure 5: Spectral visualization of different vocoder methods. The audio clip is a singing voice from the MUSDB18 test set. where {λa, λp, λri, λmel, λc} are the corresponding hyperparameters. In [Ai and Ling, 2023; Du et al., 2023], an antiwrapping loss is devised, and instantaneous phase (IP), group delay (GD) and instantaneous frequency (IF) are extracted via a specially-designed sparse matrix3. However, we find such calculation formula is usually time-consuming as most of the elements in the sparse matrix are zero and have no impact on the multiplied result. Besides, only two directions are considered for phase differential operation, resulting in limited phase relation capture. To this end, we propose a novel omnidirectional phase loss, as shown in Figure 4. Specifically, we elaborately design nine 3 3 kernels with fixed parameters K = Cat (K1, , K9) R9 3 3 to traverse the differential relations with adjacent eight T-F bins, and the fifth kernel is to return the IP. Therefore, with a simple convolution operation, the phase differential can be efficiently implemented as: Φ = Φ K, Φ = Φ K, (16) where n Φ, Φ o R9 F T , and denotes the convolution operation . Then we can calculate the phase loss with the similar anti-wrapping loss to [Du et al., 2023]. For the adversarial loss, the multi-period discriminator (MPD) [Kong et al., 2020] and multi-resolution spectrogram discriminator (MRSD) [Jang et al., 2021] are utilized. The hinge GAN is adopted, and the adversarial loss for the discriminator can be expressed as: m=1 max (0, 1 Dm (s)) + max (0, 1 + Dm ( s)) , where Dm is the m-th discriminator. For the generator, the adversarial loss is: m=1 max (0, 1 Dm ( s)) . (18) Besides, the feature matching loss is also utilized, given by: l,m |f m l ( s) f m l (s)| , (19) 3https://github.com/Yang Ai520/APCodec/blob/main/models.py Models PESQ Periodicity V/UV Pitch VISQOL RMSE F1 RMSE Wave Glow-256 3.138 0.1485 0.9378 - - Hi Fi GAN-V1 3.056 0.1671 0.9212 52.5285 4.7209 i STFTNet-V1 2.880 0.1672 0.9177 53.0724 4.6548 Univ Net-c32 3.277 0.1305 0.9347 41.5110 4.7530 Avocodo 3.217 0.1611 0.9134 51.5998 4.7620 Big VGAN-base(1M steps) 3.519 0.1287 0.9459 - - Big VGAN(1M steps) 4.027 0.1018 0.9598 - - Big VGAN-base(5M steps) 3.841 0.1073 0.9540 32.5413 4.9067 Big VGAN(5M steps) 4.269 0.0790 0.9670 24.2814 4.9632 APNet 2.897 0.1586 0.9265 39.6629 4.6659 APNet2 2.834 0.1529 0.9227 46.3732 4.5817 Vocos 3.615 0.1146 0.9484 35.5844 4.8785 RNDVo C(Ours) 4.226 0.0742 0.9698 23.9658 4.9154 Table 2: Objective comparisons among baselines on the Libri TTS benchmark. - denotes the results are not reported, and denotes the results are calculated using the open-sourced model checkpoints. where f m l ( ) denotes the l-th layer feature for m-th subdiscriminator. The final loss for generator is presented as: LG = Lrec + λg Lg + λfm Lfm, (20) where {λg, λfm} are the corresponding hyperparameters. Detailed settings can be found in the supplementary material. 4 Experiments 4.1 Datasets Two benchmarks are employed in this study, namely LJSpeech [Keith and Linda, 2017] and Libri TTS [Zen et al., 2019]. The LJSpeech dataset includes 13,100 clean speech clips by a single female, and the sampling rate is 22.05 k Hz. Following the division in the open-sourced VITS repository4, {12500, 100, 500} clips are used for training, valiation, and testing, respectively. The Libri TTS dataset covers diverse recording environments with the sampling rate of 24 k Hz. Following the division in [Lee et al., 2023], {train-clean-100, train-clean-300, train-other-500} are for model training. The subsets dev-clean + dev-other are for objective comparisons, and test-clean + test-other are for subjective evaluations. 4https://github.com/jaywalnut310/vits/tree/main/filelists Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Models GT Hi Fi GAN-V1 Avocodo Big VGAN-base Big VGAN APNet2 Vocos RNDVo C MUSHRA 89.45 0.45 69.99 1.15 68.16 1.26 74.27 1.12 79.33 0.92 54.95 1.11 73.18 1.15 **80.74 0.99 Note: p < 0.05, p < 0.1 Table 3: MUSHRA scores among different methods on the Libri TTS benchmark. The confidence level is 95%, and we performed a t-test comparing RNDVo C with Big VGAN. Ids Setting PESQ MCD Periodicity V/UV Pitch RMSE F1 RMSE 1 Baseline 3.987 2.0471 0.0854 0.9714 21.3183 2 Remove omnidirectional phase loss 3.892 2.2137 0.0889 0.9697 21.9124 3 Remove RND mode 3.655 2.4976 0.9611 0.1114 26.0120 4 Set matrice A, A as learnable 3.645 2.5989 0.9591 0.1164 25.6292 Table 4: Ablation studies conducted on the LJSpeech dataset. (I )| S|null Figure 6: Spectral visualization of the range-space and nullspace with respect to whether A , A are fixed. (a)-(c) Thematrix parameters are fixed. (d)-(f) The matrix parametersare learnable. 4.2 Configurations For the LJSpeech dataset, the number of mel-spectrogram Fm is set to 80, and the upper-bound frequency fmax is 8 k Hz. For the Libri TTS dataset, 100 mel-bands are used with fmax = 12 k Hz. For both benchmarks, 1024-point FFT is set, with 1024 Hann window and 256 hop size. A batch size of 16, a segment size of 16384, and an initial learning rate of 2e-4 are used for training. The Adam W optimizer [Loshchilov and Hutter, 2017] is employed, with {β1 = 0.8, β2 = 0.99}. The generator and discriminator are updated for 1 million steps, respectively. 4.3 Results and Analysis In this study, various representative time-domain and T-F domain-based baselines are chosen for comparisons, including Hi Fi GAN [Kong et al., 2020]5, i STFTNet [Kaneko et al., 2022]6, Avocodo [Bak et al., 2023]7, Wave Glow [Prenger et 5https://github.com/jik876/hifi-gan 6https://github.com/rishikksh20/i STFTNet-pytorch 7https://github.com/ncsoft/avocodo Models #Parm. #MACs Datasets PESQ VISQOL (M) (Giga/5s) Hi Fi GAN-V2 0.92 9.6 LJSpeech 2.848 4.5419 10.46 Libri TTS 2.308 4.3695 RNDVo C-Lite 0.71 9.54 LJSpeech 3.769 4.7817 10.39 Libri TTS 3.834 4.8736 RNDVo C-Ultra Lite 0.08 1.66 LJSpeech 3.264 4.7006 1.81 Libri TTS 3.499 4.8044 Table 5: Objective comparisons among lightweight neural vocoders. al., 2019]8, Univ Net [Jang et al., 2021]9, Big VGAN [Lee et al., 2023]10, APNet [Ai and Ling, 2023], APNet2 [Du et al., 2023]11, Free V [Lv et al., 2024]12, and Vocos [Siuzdak, 2024]13. Five metrics are involved in the objective evaluations: (1) Multi-resolution STFT (M-STFT) [Yamamoto et al., 2020] is used to evaluate the spectral distance across multiple resolutions. (2) Wide-band version of Perceptual evaluation of speech quality (PESQ) [Rec, 2005] is chosen to assess the objective speech quality. (3) Mel-cepstral distortion (MCD) [Kubichek, 1993] measures the difference between mel-spectrograms through dynamic time wrapping (DTW). (4) Periodicity RMSE, V/UV F1 score, and pitch RMSE [Morrison et al., 2022], which are regarded as major artifacts for non-autoregressive neural vocoders. (5) Virtual Speech Quality Objective Listener (VISQOL) [Hines et al., 2015] predicts the Mean Opinion Score-Listening Quality Objective (MOS-LQO) score by evaluating the spectrotemporal similarity. Except for objective metrics, we also conduct the MUSHRA testing based on the Beaqle JS platform [Kraft and Z olzer, 2014] for subjective evaluations. Thirty-five participants majoring in audio signal processing are engaged in the testing. For the former, each person rates the speech processed by different algorithms on a scale ranging from 0 to 100 in terms of the overall similarity compared to the reference clips. Comparisons with SOTA Methods Tables 1 and 2 present the objective comparisons on the LJSpeech and Libri TTS datasets, respectively. Several valuable observations can be made. First, T-F domain-based 8https://github.com/NVIDIA/waveglow 9https://github.com/maum-ai/univnet 10https://github.com/NVIDIA/Big VGAN 11https://github.com/redmist328/APNet2 12https://github.com/Baker Bunker/Free V 13https://github.com/gemelo-ai/vocos Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Figure 7: Spectral visualization generated by three light-weight neural vocoders, namely Hi Fi GAN-V2, RNDVo C-Lite, and RNDVo CUltra Lite. The audio clip is a speech voice from the Libri TTS test set. methods exhibit overall faster inference speed over timedomain-based methods. This is mainly due to the use of STFT and its inverse transform, i.e., i STFT, and no upsampling opeartion is required. Second, T-F domain based methods possess overall notably less computation complexity, e.g., 5.8 GMACs of Vocos vs. 152.9 GMACs of Hi Fi GAN. This has made T-F domain based neural vocoders attractive more recently. Third, compared with Big VGAN, the speech quality of existing T-F domain based neural vocoders are still notably inferior. Thanks to the fined-grained modeling along the time and sub-band axes, the proposed RNDVo C enjoys both light-weight network structure and promising performance. To be specific, with less than 3% trainable parameters and 10% computation cost, our approach yields comparable performance over Big VGAN in the LJSpeech dataset, and better performance in the Libri TTS benchmark. It is noteworthy that even compared with Big VGAN trained for 5M steps, our method still behaves competitive. It fully validates the effectiveness of the proposed approach. Table 3 gives the MUSHRA results on the test set of Libri TTS dataset. One can observe that our RNDVo C is superior to Big VGAN with statistical difference (p < 0.05), further validating the advantage of our method in subjective quality. We notice that the subjective score of APNet2 is notably inferior to other methods. According to the feedback from the listeners, audible husky buzzing artifacts can arise in the generated utterances from APNet2, which leads to biased low score. Figure 5 shows the spectral visualizations among different models, where the clip is a vocal voice from the out-ofdistribution MUSDB18 [Rafii et al., 2017] test set. Evidently, compared with other baselines, our approach can better recover the harmonic details. Ablation Studies Table 4 presents the ablation studies based on the LJSpeech dataset. First, we replace the proposed omnidirectional phase loss with that in [Du et al., 2023], as shown from id1 to id2. One can observe clear performance degradation in objective metrics, indicating the effectiveness of the omnidirectional phase loss. Based on id2, we further remove the RND mode in target reconstruction, i.e., the explicit superimposition between | S|range and I A A | S|null is changed into the direct target mapping. From id2 to id3, notable performance degradation can be observed, indicating the significance of the proposed RND modeling. Then based on id2, we set the matrice A , A as learnable, trying to break the orthogonality between the two sub-spaces. From id2 to id4, we also observe the performance degradation, indicating the significance of orthogonality between two sub-spaces. In Figure 6, we visualize the estimated components form the range-space and null-space in terms of whether A , A are fixed. One can observe that if the parameters are fixed, the null-space, i.e., I A A | S|null can capture the sparse and detailed harmonic structure, as shown in Figure 6(c). In contrast, in the learnable scenario, the estimation of the null-space is no longer sparse, indicating that the orthogonality property may not hold. Toward Light-weight Design To meet the requirements of edge-devices in practical scenarios, we investigate the potential of our method in lightweight design. Concretely, the number of DPB B is reduced into 4. When the channel size C is squeezed to 128 and 32, we obtain the light and ultra-lite version, namely abbreviated as RNDVo C-Lite and RNDVo C-Ultra Lite. Table 5 presents the performance of different light-weight models. Evidently, thanks to the efficacy of RND in information preservation, where the learning network only needs to concentrate on the null-space part, even only with 0.08M, our method still notably outperforms Hi Fi GAN-V2 in PESQ and VISQOL, which validates the effectiveness of our method. In Figure 7, we present the spectral representations reconstructed by three lightweight vocoders. Notably, while Hi Fi GAN-V2 exhibits significant harmonic blurring (see the blue-boxed zoomed region), both RNDVo C-Lite and RNDVo C-Ultra Lite retain clear harmonic details. This contrast further highlights the superiority of our approach in lightweight design scenarios. 5 Conclusion In this paper, we propose an innovative T-F domain-based neural vocoders. We bridge the connection with classical range-null decomposition theory, where the range-space aims to convert the original acoustic feature in the mel-scale domain into the target linear-scale domain, and null-space serves to reconstruct the remaining spectral details. Based on that, a novel dual-path network structure is devised, where the cross-band and narrow-band modules are devised to efficiently model the relations in the sub-band and time axes. Extensive experiments are conducted on the LJSpeech and Libri TTS benchmarks to validate the efficacy of the proposed method. [Ai and Ling, 2023] Yang Ai and Zhen-Hua Ling. APNet: An all-frame-level neural vocoder incorporating direct prediction of amplitude and phase spectra. IEEE/ACM Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Trans. Audio, Speech, Lang. Process., 31:2145 2157, 2023. [Bak et al., 2023] Taejun Bak, Junmo Lee, Hanbin Bae, Jinhyeok Yang, Jae-Sung Bae, and Young-Sun Joo. Avocodo: Generative adversarial network for artifact-free vocoder. In Proc. AAAI, volume 37, pages 12562 12570, 2023. [Baraniuk et al., 2010] Richard G Baraniuk, Volkan Cevher, Marco F Duarte, and Chinmay Hegde. Model-based compressive sensing. IEEE Trans. Inf. Theory, 56(4):1982 2001, 2010. [Chen et al., 2021] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wave Grad: Estimating Gradients for Waveform Generation. In Proc. ICLR, 2021. [Dinh et al., 2016] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016. [Du et al., 2023] Hui-Peng Du, Ye-Xin Lu, Yang Ai, and Zhen-Hua Ling. APNet2: High-Quality and High Efficiency Neural Vocoder with Direct Prediction of Amplitude and Phase Spectra. In Proc. NCMSC, pages 66 80. Springer, 2023. [He et al., 2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification. In Proc. ICCV, pages 1026 1034, 2015. [Hines et al., 2015] Andrew Hines, Jan Skoglund, Anil C Kokaram, and Naomi Harte. Vi SQOL: an objective speech quality model. EURASIP J. Audio Speech Music Process., 2015:1 18, 2015. [Huang et al., 2022a] Rongjie Huang, Max WY Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. Fast Diff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis. In Proc. IJCAI, pages 4157 4163, 2022. [Huang et al., 2022b] Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, and Yi Ren. Prodiff: Progressive fast diffusion model for high-quality text-tospeech. In Proc. ACMMM, pages 2595 2605, 2022. [Jang et al., 2021] Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. Univ Net: A neural Vocoder with Multi-resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. In Proc. Interspeech, pages 2207 2211, 2021. [Juvela et al., 2019] Lauri Juvela, Bajibabu Bollepalli, Vassilis Tsiaras, and Paavo Alku. Glotnet a raw waveform model for the glottal excitation in statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process., 27(6):1019 1030, 2019. [Kalchbrenner et al., 2018] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In Proc. ICML, pages 2410 2419. PMLR, 2018. [Kaneko et al., 2022] Takuhiro Kaneko, Kou Tanaka, Hirokazu Kameoka, and Shogo Seki. i STFTNet: Fast and lightweight mel-spectrogram vocoder incorporating inverse short-time fourier transform. In Proc. ICASSP, pages 6207 6211. IEEE, 2022. [Kawahara, 2006] Hideki Kawahara. STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acoustical science and technology, 27(6):349 353, 2006. [Keith and Linda, 2017] I. Keith and J. Linda. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/, 2017. Accessed: 2025-01-12. [Kim et al., 2019] Sungwon Kim, Sang-gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Yoon. Flo Wave Net: A Generative Flow for Raw Audio. In Proc. ICML, pages 3370 3378. PMLR, 2019. [Kong et al., 2020] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proc. Neur IPS, pages 17022 17033, 2020. [Kong et al., 2021] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diff Wave: A Versatile Diffusion Model for Audio Synthesis. In Proc. ICLR, 2021. [Kraft and Z olzer, 2014] Sebastian Kraft and Udo Z olzer. Beaqle JS: HTML5 and Java Script based framework for the subjective evaluation of audio quality. In Linux Audio Conference, Karlsruhe, DE, 2014. [Kubichek, 1993] Robert Kubichek. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE pacific rim conference on communications computers and signal processing, volume 1, pages 125 128. IEEE, 1993. [Kumar et al., 2019] Kundan Kumar, Rithesh Kumar, Thibault De Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre De Brebisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. Proc. Neur IPS, 32, 2019. [Lee et al., 2022] Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, and Tie-Yan Liu. Prior Grad: Improving Conditional Denoising Diffusion Models with Data Dependent Adaptive Prior. In Proc. ICLR, 2022. [Lee et al., 2023] Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Big VGAN: A Universal Neural Vocoder with Large-Scale Training. In Proc. ICLR, 2023. [Lei Ba et al., 2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. Ar Xiv e-prints, pages ar Xiv 1607, 2016. [Li et al., 2022] Andong Li, Shan You, Guochen Yu, Chengshi Zheng, and Xiaodong Li. Taylor, Can You Hear Me Now? A Taylor-Unfolding Framework for Monaural Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Speech Enhancement. In Proc. IJCAI, pages 4193 4200, 2022. [Li et al., 2025] Andong Li, Zhihang Sun, Fengyuan Hao, Xiaodong Li, and Chengshi Zheng. Neural vocoders as speech enhancers. ar Xiv preprint ar Xiv:2501.13465, 2025. [Liu et al., 2023] Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. Audio LDM: Text-to-Audio Generation with Latent Diffusion Models. In Proc. ICML, pages 21450 21474. PMLR, 2023. [Loshchilov and Hutter, 2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017. [Lu et al., 2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Proc. Neur IPS, 35:5775 5787, 2022. [Lv et al., 2024] Yuanjun Lv, Hai Li, Ying Yan, Junhui Liu, Danming Xie, and Lei Xie. Free V: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter. In Proc. Interspeech, pages 3869 3873, 2024. [Morise et al., 2016] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans Inf Syst, 99(7):1877 1884, 2016. [Morrison et al., 2022] Max Morrison, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron Courville, and Yoshua Bengio. Chunked Autoregressive GAN for Conditional Waveform Synthesis. In Proc. ICLR, 2022. [Oord et al., 2018] Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In Proc. ICML, pages 3918 3926. PMLR, 2018. [Ping et al., 2020] Wei Ping, Kainan Peng, Kexin Zhao, and Zhao Song. Waveflow: A compact flow-based model for raw audio. In Proc. ICML, pages 7706 7716. PMLR, 2020. [Prenger et al., 2019] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-based generative network for speech synthesis. In Proc. ICASSP, pages 3617 3621. IEEE, 2019. [Quan and Li, 2024] Changsheng Quan and Xiaofei Li. Spatial Net: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation. IEEE/ACM Trans. Audio, Speech, Lang. Process., 32:1310 1323, 2024. [Rafii et al., 2017] Zafar Rafii, Antoine Liutkus, Fabian Robert St oter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation. 2017. [Ramachandran et al., 2017] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. ar Xiv preprint ar Xiv:1710.05941, 2017. [Rec, 2005] ITUT Rec. P. 862.2: Wideband extension to recommendation p. 862 for the assessment of wideband telephone networks and speech codecs. International Telecommunication Union, CH Geneva, 41:48 60, 2005. [Siuzdak, 2024] Hubert Siuzdak. Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. In Proc. ICLR, 2024. [Valin and Skoglund, 2019] Jean-Marc Valin and Jan Skoglund. LPCNet: Improving neural speech synthesis through linear prediction. In Proc. ICASSP, pages 5891 5895. IEEE, 2019. [Van Den Oord et al., 2016] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, et al. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 12, 2016. [Wang et al., 2017] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards End-to-End Speech Synthesis. In Proc. Interspeech, pages 4006 4010, 2017. [Woo et al., 2023] Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proc. CVPR, pages 16133 16142, 2023. [Yamamoto et al., 2020] Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel Wave GAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In Proc. ICASSP, pages 6199 6203. IEEE, 2020. [Yu et al., 2023] Jianwei Yu, Yi Luo, Hangting Chen, Rongzhi Gu, and Chao Weng. High Fidelity Speech Enhancement with Band-split RNN. In Proc. Interspeech, pages 2483 2487, 2023. [Zen et al., 2019] Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libri TTS: A Corpus Derived from Libri Speech for Text-to-Speech. In Proc. Interspeech, pages 1526 1530, 2019. [Zhou et al., 2024] Rui Zhou, Xian Li, Ying Fang, and Xiaofei Li. Mel-Full Sub Net: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR. ar Xiv preprint ar Xiv:2402.13511, 2024. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)