# visual_hallucination_elevates_speech_recognition__73590b4e.pdf Visual Hallucination Elevates Speech Recognition Fang Zhang1,2, Yongxin Zhu1,2, Xiangxiang Wang3, Huang Chen3, Xing Sun3, Linli Xu1,2 1School of Computer Science and Technology, University of Science and Technology of China 2State Key Laboratory of Cognitive Intelligence 3Tencent You Tu Lab {fangzhang,zyx2016}@mail.ustc.edu.cn {xenoswang,huaangchen,winfredsun}@tencent.com linlixu@ustc.edu.cn Due to the detrimental impact of noise on the conventional audio speech recognition (ASR) task, audio-visual speech recognition (AVSR) has been proposed by incorporating both audio and visual video signals. Although existing methods have demonstrated that the aligned visual input of lip movements can enhance the robustness of AVSR systems against noise, the paired videos are not always available during inference, leading to the problem of the missing visual modality, which restricts their practicality in real-world scenarios. To tackle this problem, we propose a Discrete Feature based Visual Generative Model (DFVGM) which exploits semantic correspondences between the audio and visual modalities during training, generating visual hallucinations in lieu of real videos during inference. To achieve that, the primary challenge is to generate the visual hallucination given the noisy audio while preserving semantic correspondences with the clean speech. To tackle this challenge, we start with training the audio encoder in the Audio-Only (AO) setting, which generates continuous semantic features closely associated with the linguistic information. Simultaneously, the visual encoder is trained in the Visual-Only (VO) setting, producing visual features that are phonetically related. Next, we employ K-means to discretize the continuous audio and visual feature spaces. The discretization step allows DFVGM to capture high-level semantic structures that are more resilient to noise and generate visual hallucinations with high quality. To evaluate the effectiveness and robustness of our approach, we conduct extensive experiments on two publicly available datasets. The results demonstrate that our method achieves a remarkable 53% relative reduction (30.5% 12.9%) in Word Error Rate (WER) on average compared to the current state-of-the-art Audio-Only (AO) baselines while maintaining comparable results (< 5% difference) under the Audio Visual (AV) setting even without video as input. Introduction Recognizing speech is essential for natural human-computer interactions, facilitating accessibility for individuals with disabilities, and advancing various applications such as virtual assistants, transcription services, and voice-controlled technologies. In recent years, end-to-end Automatic Speech Recognition (ASR) based on deep learning (Graves, Mohamed, and Hinton 2013; Hinton et al. 2012) has become Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. the standard approach. Meanwhile, the quality and intelligibility of speech recognition are highly vulnerable to noise and may degrade dramatically with corrupted speech (Vincent et al. 2017; Kinoshita et al. 2020). Therefore, enhancing noise robustness is crucial for ASR systems. Motivated by the fact that invariant lip movements in a video are not affected by noisy environments, Audio Visual Speech Recognition (AVSR) is proposed to transcribe text from both audio and visual streams. Various studies have confirmed the significant superiority of Audio-Visual (AV) models over their Audio-only (AO) counterparts in diverse noisy scenarios (Son Chung et al. 2017; Petridis et al. 2018a,b), as well as in handling overlapping speech (Rose et al. 2021; Yu et al. 2020). However, the paired visual input is not always accessible at inference time, which is quite common in practice. For instance, the speaker may step away or eat food, the audio and lips go out of sync, and the camera or recording devices can be turned off, etc. These significantly limit the applicability of these AVSR methods in real-life scenarios, as most of them are unable to handle the absence of the visual modality. The conventional approach to addressing the problem of missing modality generally involves modality translation, which reconstructs the absent modality by leveraging information from the available modalities. In the AVSR task, it becomes more challenging due to the corruption of the audio modality. Hegde et al. propose to generate accurate lip movements given noisy audio. By employing a pretrained speech-to-lip model called Wav2Lip (Prajwal et al. 2020) as a teacher network that generates precise lip movements from clean speech, a student network is subsequently trained to imitate the teacher s lip movements given noisy speech. However, a major limitation of this method is its disregard of the high-level semantic relationships between the audio and visual modalities, resulting in the generation of pseudo videos with low information density. Therefore, an additional visual encoder is required in (Hegde et al. 2021) to extract the semantic features with a higher correlation to the speech content. In this paper, to effectively address the aforementioned challenges regarding the visual modality dropout, we directly model the semantic relationships between the audio and visual modalities in the discrete feature spaces rather than in the real feature space. Firstly, we pretrain the au- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) dio and visual encoders through Audio Speech Recognition (ASR) and Lip Reading Recognition tasks respectively. These tasks enable the encoders to generate continuous semantic features that are strongly associated with the phonetic and linguistic information (Shi et al. 2021). Subsequently, we apply K-means (Lloyd 1982) clustering to discretize the feature spaces into the audio and visual codebooks. The discrete encoding has the following advantages compared to the continuous embedding: Firstly, the audio codebook is inferred from the clean audio tracks which is noise-invariant. Intuitively, a noisy audio is not far from its clean counterpart in the feature space. Therefore, by finding the nearest neighbor in the codebook, the noisy audio can be partly converted to its clean counterpart in the discrete feature space, which reduces the noise to some extent. Furthermore, the discrete feature spaces facilitate capturing the high-level semantic structures and identifying the semantic correlations. After discretizing both video and audio features into sequences using codebooks, Discrete Feature based Visual Generative Model (DFVGM), employing an encoder-decoder architecture is trained to generate visual sequences based on the clean or noisy audio sequences in an auto-regressive manner. Furthermore, to strengthen the short-range dependencies between audio and visual tokens, we introduce a distance penalty to penalize the crossattention logits where the visual tokens focus more on the audio tokens at a closer time step. Finally, we train the fusion module and decoder with an additional consistency loss based on the KL divergence to reduce the mismatch between the real visual tokens and pseudo visual tokens generated from our DFVGM. During inference, our model requires only audio signals while employing visual hallucinations for multimodal fusion instead, as illustrated in Figure 1. Our key contributions are summarized as follows: We investigate the scenario where the visual modality is completely missing during inference in audio visual speech recognition. We introduce a novel Discrete Feature based Visual Generative Model, which captures semantic correspondences between the visual and audio modalities during training in discrete spaces and generates visual hallucinations in lieu of real visual inputs during inference. Extensive experiments on two public datasets demonstrate that our visual hallucinations can be leveraged to improve the robustness and performance of AO models. Our approach outperforms state-of-the-art AO baselines by a large margin, achieving an average reduction of 53% in WER across different SNR ratios. By utilizing ground truth visual information as input, our approach achieves an absolute improvement of 1.2% WER reduction over other state-of-the-art AV baselines. Related Work Audio Visual Speech Recognition Most existing AVSR systems share a similar architecture composed of an audio encoder, a visual encoder, a fusion module, and a decoder (Pan et al. 2022) as shown Figure 1: We propose DFVGM to generate visual features based on audio features. During Inference, the fusion module takes audio features and pseudo visual features as input. In comparison, prior AVSR approaches require multimodal inputs. is Figure 2(a). Previous works have made improvements specifically for these four components. Typically, both the audio encoder and visual encoder consist of two components: a front-end and a back-end. For the front-end, Pan et al. observe that utilizing pre-trained models such as Wav2Vec (Baevski et al. 2020), and Moco (Chen et al. 2020) to initialize the parameters of front-ends could enhance the performance for AVSR. The purpose of the back-end is to model temporal relationships, where sequence models such as RNN, LSTM have been widely employed in previous works (Makino et al. 2019). Besides, earlier works (Ma, Petridis, and Pantic 2021; Burchi and Timofte 2023) demonstrate that the Conformer architecture (Gulati et al. 2020) can better capture the temporal information locally and globally by progressively down-sampling the temporal sequence and reducing the computational overhead. For the fusion model, the most common fusion strategy is concatenating two context vectors over the channel dimension (Afouras et al. 2018). However, it is pointed out that the straightforward concatenation of features fails to provide insight into the level of reliability of a particular data stream (Potamianos et al. 2003). To address that, multi-modality attention is proposed to adjust its modality attention towards the most reliable input modality (Zhou et al. 2019). Similarly, AV-Rel Score (Hong et al. 2023) computes reliability scores for each time step, indicating how much the current audio features and visual features contribute to recognizing speech. Inspired by speech enhancement (Benesty, Makino, and Chen 2006), V-CAFE (Hong et al. 2022) introduces a noise reduction mask to encode audio features, aiming to reduce noise in audio representations. For the decoder, the Connectionist Temporal Classification (CTC) loss (Graves et al. 2006) and Sequence-to-Sequence loss (Sutskever, Vinyals, and Le 2014) based on crossentropy are widely applied in end-to-end speech recognition The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) systems. A hybrid CTC/attention architecture (Petridis et al. 2018b) has recently been proposed to address the limitations of both CTC and attention models. This architecture seeks to enforce monotonic alignments while also eliminating conditional independence. Besides, an additional language model which is trained separately on a large corpus of text data has been shown effective in augmenting the decoding process by incorporating linguistic contexts (Ma, Petridis, and Pantic 2021). Missing Modality The missing modality problem has received significant attention in the multimodal learning community, spanning diverse applications including emotion recognition (Zhao, Li, and Jin 2021), medical image segmentation (Azad et al. 2022), audio-visual expression recognition (Parthasarathy and Sundaram 2020), etc. Specifically, in the domain of AVSR, an initial endeavor (Chang et al. 2022) involves a cascaded model, wherein, for every video frame, the model follows the AV path if the frame is available and resorts to the AO path otherwise. On top of this, modality dropout (Shi et al. 2021; Shi, Hsu, and Mohamed 2022) is proposed to address the input discrepancy by masking the full features of one modality before fusing the audio and visual inputs. However, simply ignoring the visual modality and focusing more on the audio inputs would degrade performances in noisy settings. Method In this section, we introduce the pipeline of training stages. Initially, to obtain the continuous audio and visual feature spaces containing semantic information, we follow (Pan et al. 2022) to pretrain the audio encoder and visual encoder separately in audio-only (AO) and visual-only (VO) settings, where the audio front-end is initialized using Wav2Vec (Baevski et al. 2020), and the visual front-end is initialized by Moco V2 (Chen et al. 2020)1. We then discretize the continuous audio and visual feature spaces by applying K-means clustering, resulting in audio and visual tokens. After that, DFVGM is trained to learn the mapping between visual and audio token sequences. Finally, we further train the fusion module and decoder under the inputs of discrete audio and visual tokens. The complete training pipeline is illustrated in Figure 2. Codebook Suppose we are given a dataset with paired clean audio and visual recordings: A = {am}N m=1 , V = {vm}N m=1 where N is the number of pairs in the dataset. The audio recording am is processed by an audio encoder, producing the sequence {f t m}tm t=1 where tm is the length of the m-th audio feature sequence. We collect all the sequences for every audio clip in the dataset FA = f 1 1 , f 2 1 , , f 1m 1 , , f 1 m, , f tm m , , f 1 N, , f t N N , to which K-means is applied to quantize the audio feature 1The Moco V2 is firstly trained on the datastet LRW (Son Chung et al. 2017). space and produce KA clusters which constitute the audio codebook. We denote the audio codebook as EA = ek a KA k=1 where d is the dimension of each cluster. For the features of an audio recording, by finding the nearest neighbors in the audio codebook, we can obtain an audio discrete token sequence as x = [x1, , x T ] where xi {1, , KA} is the index of its nearest audio cluster in EA. Similarly we can obtain the visual codebook EV = ek v KV k=1 where KA is not equal to KV in general, and an visual discrete token sequence as y = [y1, , y T ] where yi {1, , KV }. Discrete encoding has the following advantages compared to continuous encoding: Firstly, the audio and visual codebooks derived from clean inputs exhibit higher-level semantic structures. This characteristic facilitates the exploration of semantic relationships between two modalities. Additionally, in the continuous feature space, the features of noisy audio are not significantly distant from their clean counterparts. Intuitively, by identifying the nearest neighbor in the codebook, the noisy audio corresponds to the same cluster as its clean counterpart. Furthermore, discrete encoding enables the utilization of sequence-to-sequence generation with a cross-entropy loss, similar to natural language processing (NLP), which avoids the problem of continuous representations collapsing to the mean value. DFVGM The proposed DFVGM is a transformer-based model consisting of several encoder and decoder layers and is designed to generate visual hallucinations that have high semantic correspondences with the audio modality. We model the conditional probability distribution in an auto-regressive way as follows: i=1 p(yi | x) (1) The object function of DFVGM is computed as follows: L = log p CE(y | x) (2) Noise Augmented Training Previous works (Xu et al. 2020; Ma, Petridis, and Pantic 2021) add noise to the clean audio, sampled at a fixed or signal-to-noise ratio (SNR) to boost robustness to noise. We extend this noise augmented training to DFVGM with the assumption that both noisy and clean audio tokens should correspond to the same visual tokens. By adding diverse noise to an audio stream during training and then discretizing it, we can obtain the noisy audio sequence. Our DFVGM models the conditional probability based on the noisy or clean audio sequences, thereby narrowing the domain gap between the noisy and clean audio. Distance Penalty The differences between the task of transforming audio tokens into visual tokens and machine translation can be observed in two aspects. Firstly, audio tokens and visual tokens should be of equal lengths considering that they are synchronized in time and need to be concatenated in the fusion module. Secondly, due to the down- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Figure 2: Training pipeline of our model. Yellow blocks represent newly initialized parameters, while blue blocks represent parameters that are inherited from the last stage. sampling of the audio and visual encoders, xi and yi represent audio and visual features over a period of time and are strongly correlated, which requires the model to have stronger short-range modeling capabilities. As a result, the vanilla transformer architecture is likely to perform poorly on this task. We take the inspiration from the recent work on speech-to-speech translation (Di Gangi, Negri, and Turchi 2019) and apply a distance penalty to each head of its crossattention layer in the decoder. Cross ATTN = softmax QKT dmodel π(D) V (3) where dmodel is the attention dimension, K, V RT dmodel are the key and value inputs from the encoder output, Q RT dmodel is the query inputs from the previous selfattention layer. D RT T is the position distance matrix, representing the time difference between the audio tokens xi and visual tokens yj, i.e., Dij = |i j|. π( ) is based on a hard-coded function to penalize the attention logits as follows: π(D) = 0 if Dij < R ln (R) otherwise (4) where R is a hyper-parameter controlling the range of the local dependency. By doing so, the visual token yj would pay more attention to the audio tokens xj R+1, , xj, , xj+R 1. For the audio tokens outside of the window, the logarithm function is applied to bias the long-dependence range but the penalty increases slowly with distance, which allows modeling global dependencies. During inference, to generate the pseudo visual sequence with the same length of the audio sequence, we set the number of inference steps to be the same as T which results in y = [ y1, , y T ]. yi = arg max k {1, ,KV } DFVGM( yi = k | y