# audiogenx_explainability_on_texttoaudio_generative_models__7d762892.pdf

Audio Gen X: Explainability on Text-to-Audio Generative Models

Hyunju Kang1*, Geonhee Han1*, Yoonjae Jeong2, Hogun Park1

1Department of Artificial Intelligence, Sungkyunkwan University, Suwon, Republic of Korea 2Audio AI Lab, NCSOFT, Seongnam, Republic of Korea {neutor, gunhee8178}@skku.edu, hybris75@gmail.com, hogunpark@skku.edu

Text-to-audio generation models (TAG) have achieved significant advances in generating audio conditioned on text descriptions. However, a critical challenge lies in the lack of transparency regarding how each textual input impacts the generated audio. To address this issue, we introduce Audio Gen X, an Explainable AI (XAI) method that provides explanations for text-to-audio generation models by highlighting the importance of input tokens. Audio Gen X optimizes an Explainer by leveraging factual and counterfactual objective functions to provide faithful explanations at the audio token level. This method offers a detailed and comprehensive understanding of the relationship between text inputs and audio outputs, enhancing both the explainability and trustworthiness of TAG models. Extensive experiments demonstrate the effectiveness of Audio Gen X in producing faithful explanations, benchmarked against existing methods using novel evaluation metrics specifically designed for audio generation tasks.

Introduction Text-to-audio generation models (TAG) (Kreuk et al. 2023; Ziv et al. 2024; Yang et al. 2023; Liu et al. 2023; Schneider et al. 2023) have emerged as a pivotal technology in generative AI, enabling textual content to be transformed into an auditory experience. Although models such as Audio Gen (Kreuk et al. 2023) excel at generating high-quality audio based on textual prompts, a critical challenge remains: the lack of transparency in how each textual input affects the generated audio. Consequently, users may struggle to trust the model, making it essential to provide explanations for the TAG task. Explainability provides several key advantages. First, it enhances awareness of how input tokens affect the model s outputs, enabling users to ensure that the model emphasizes the correct aspects of the text. Second, it provides actionable insights to support the decision-making about which elements to modify and to what extent in the audio editing process. Third, analyzing generated explanations can aid with debugging and identifying potential biases. Accordingly, this study argues that the ability to quantify the

*Equal contribution. Corresponding author. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

A vehicle is in motion, and the horn sounds

0.33 in 0.47

0.05 , 0.18

0.80 in 0.39

0.10 , 0.26

0.04 in 0.47

0.01 , 0.12

Figure 1: A comprehensive explanation provided by Audio Gen X for the entire audio in (a). Granular explanations for the interval from 1 to 1.5 seconds in (b) and from 2.5 to 3 seconds in (c), respectively.

importance of textual inputs in TAG models is crucial to being able to unambiguously assess and communicate their value. While approaches specifically tailored for explaining TAG models are limited, recent research has explored methodologies for calculating the importance of input tokens in large-scale transformer-based models. Crossattention layers in multi-modal architectures, such as those in TAG models, are widely regarded as critical for integrating textual and auditory information, while also enhancing explainability by revealing how information from one modality influences another. A notable method (Abnar and Zuidema 2020) utilizes attention weights and aggregates them across all layers to approximate the importance of each input token. However, attention scores alone are not considered reliable for causal insights, as they do not directly indicate how perturbation to specific inputs influences the output. Recently, At Man (Deiseroth et al. 2023) introduced a perturbation method that suppresses the attention score of one token at a time to observe the impact of each input on output prediction. This single-token perturbation approach, however, may overlook interactions between multiple tokens. Consequently, it provides less reliable explanations in scenarios where the model heavily relies on the contextual relationships between multiple tokens, leading to an oversimplification of the model s behavior. To address the challenge of faithful explanations, causal inference theory, encompassing factual and counterfactual

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

reasoning, is often utilized (Pearl 2009). These two approaches aim to identify impactful input information in different ways. Factual reasoning focuses on identifying critical input information that reproduces the original prediction, whereas counterfactual reasoning (Tan et al. 2022; Ali et al. 2023; Kenny et al. 2021) seeks to determine crucial input information that, if absent, would change the prediction. Given their differing assumptions, these reasoning approaches can be employed together as complementary frameworks to generate more faithful explanations. However, prior research has yet to investigate the feasibility of applying factual and counterfactual reasoning within TAG models. To provide faithful explanations for the TAG model, we introduce Audio Gen X, a perturbation-based explainability method leveraging factual and counterfactual reasoning. Our approach utilizes the latent representation vectors in TAG models to observe the effects of factual and counterfactual perturbations. These perturbations are applied in the crossattention layer using a soft mask, enabling the simultaneous perturbation of multiple tokens attention scores. More importantly, the mask itself serves as an explanation, with its values quantifying the importance of the textual input. We optimize the mask through a gradient descent method guided by our proposed factual and counterfactual objective functions. To mitigate the high computational cost of calculating gradients for the entire sequential audio, we enhance efficiency by decomposing the explanation target into individual audio tokens. This approach enables us to customize the explanation range of generated audio interactively, providing comprehensive explanations for the entire audio or more granular explanations for specific segments of interest, depending on user demand. For instance, in Figure 1, (a) provides a comprehensive explanation for the entire audio, indicating a strong relation to vehicle motion. By focusing on a specific interval in (b) and (c), Audio Gen X captures the different contexts of each audio segment and delivers contextually accurate explanations accordingly. Extensive experiments demonstrate the faithfulness of our explanations and benchmark their performance against recent techniques using proposed evaluation metrics for audio generation tasks. Contributions. We summarize our contributions as follows: 1) We propose a faithful explanation method for textto-audio generation models, grounded in factual and counterfactual reasoning to quantify the importance of text tokens to the generated audio. 2) We offer a framework that provides both holistic and granular audio explanations based on user requests, enabling tailored insights. 3) We introduce new evaluation metrics for text-to-audio explanations and demonstrate the effectiveness of Audio Gen X through extensive experiments compared to existing methods. 4) We present case studies demonstrating how Audio Gen X provides valuable insights to support the understanding of model behavior and editing tasks.

Related Work Text-to-Audio Generation Models. Recent text-to-audio generation models can be categorized into two model architectures: Transformer-based (Kreuk et al. 2023; Ziv et al. 2024) and Diffusion-based (Yang et al. 2023; Liu et al. 2023;

Schneider et al. 2023). Transformer models, such as Audio Gen (Kreuk et al. 2023), employ autoregressive Transformers to predict discrete audio tokens, while MAGNe T (Ziv et al. 2024) enhances efficiency through masked generative modeling in a non-autoregressive scheme. Diffusionbased approaches such as Diffsound (Yang et al. 2023) generate discrete mel-spectrogram tokens, whereas models like Audio LDM (Liu et al. 2023) and Moˆusai (Schneider et al. 2023) directly predict continuous mel-spectrograms or waveforms. Despite architectural differences, these models commonly use cross-attention mechanisms, making Audio Gen X a model-agnostic explainer for TAG models that use cross-attention in audio generation. Explainable AI. Explainability involves methods that help to understand the importance of each input token with respect to output predictions. These methods generally fall into two categories: gradient-based methods (Selvaraju et al. 2017; Sundararajan, Taly, and Yan 2017; Nagahisarchoghaei et al. 2023) and perturbation-based methods (Ribeiro, Singh, and Guestrin 2016; Lundberg and Lee 2017). Gradientbased explanation methods trace gradients from the target layers to the predictive value, using the calculated gradients as a measure of importance. While effective, these methods require substantial memory resources to store the values of each targeted layer. In contrast, perturbation-based methods, such as SHAP (Lundberg and Lee 2017), are more memoryefficient, calculating feature importance by comparing predictions with and without specific features. Similarly, our method adopts a perturbation-based approach to effectively generate explanations. Explainability on Audio Processing Models. Existing explainability approaches (Akman and Schuller 2024) on audio processing models have extended generic explanation methods. For instance, one study (Becker et al. 2018) employs Layer-wise Relevance Propagation (LRP) to explain the model trained on raw waveforms and spectrograms for spoken digit and speaker gender classification. Another study applied DFT-LRP (Frommholz et al. 2023) to audio event detection, assessing the significance of time-frequency components and guiding input representation choices. Similarly, audio LIME (Haunschmid, Manilow, and Widmer 2020) extends LIME (Ribeiro, Singh, and Guestrin 2016) to explain music-tagging models by perturbing audio components derived from source separation. However, since the above methods focus on explaining audio continuously and sequentially, they are not directly applicable to the unique challenges posed by TAG models, which require techniques that address the complex interactions between text inputs and generated audio outputs. Explainability on Transformer. With the widespread use of Transformers, the demand for explainability has grown. Primarily, Rollout (Abnar and Zuidema 2020) primarily aggregates attention weights in all layers to track information flow but struggles to integrate cross-attention weights in multi-modal models with differing domain dimensionalities. Another recent work (Chefer, Gur, and Wolf 2021) leverages Layer-wise Relevance Propagation (LRP) (Samek, Wiegand, and M uller 2017) to calculate class-specific relevance scores based on gradients of attention weights in

selfand cross-attention layers. Nevertheless, At Man (Deiseroth et al. 2023) raises the issue of excessive memory usage and introduces a scalable explanation method that employs single-token perturbation to observe the change of loss in the response. While intuitive and memory-efficient for largescale models, this method is limited in its ability to account for the interrelationship of input tokens.

Preliminaries

Audio Gen (Kreuk et al. 2023), a representative TAG model, consists of three key components: a text encoder (Raffel et al. 2020), an autoregressive Transformer decoder model (Vaswani et al. 2017), and an audio decoder (D efossez et al. 2023). The Transformer decoder serves as the core model responsible for generating the audio sequence, while the text encoder processes the input text and the audio decoder post-processes the generated audio token sequence into audio. Given a text prompt, it is converted into a tokenized representation vector, denoted as U = [u1, . . . , u L], U RL du, where L denotes number of textual tokens and du represents a dimension of the textual token representation vector. The generated audio can be expressed in a discrete form, as En Codec (D efossez et al. 2023) converts the audio into either discrete tokens or continuous token representations. The tokenized audio sequence is denoted as Z = [z1, . . . , z T ], Z NT dv, where T denotes the length of the audio sequence and dv indicates the number of codebooks dv. In detail, the codebook is a structured set of discrete audio tokens used in multi-stream audio generation to produce high-quality audio. For more comprehensive information on multi-streaming audio generation, we refer to the original Audio Gen paper (Kreuk et al. 2023). For the generation of an audio sequence, the Transformerdecoder model (Vaswani et al. 2017), denoted as h, generates zt as t-th order audio token in the sequence, following the formulation h(U, zt 1) = zt. For brevity, we omit the detailed notation of other components and the top-p or topk sampling process in the Transformer. Instead, we focus on the attention layers, including cross-attention, which are crucial components of the model, denoted as f. The computation within these layers is expressed in a simplified version as f(U, zt 1) = et, where et represents the latent representation vector corresponding to the t-th audio token. In the absence of ground truth and class labels, the latent embedding vector et in the audio token space provides information on how perturbation impacts subsequent generations. Particularly, the cross-attention layer is essential to fuse the textual information with auditory information in layers f, we denote the cross-attention layers as:

g(Q, K, V) = σ QK

where σ indicates a softmax function, Q, K, V, dk refers to query, key, values, and the number of vector dimensions in the k-th layer, respectively. In detail, Q refers to previously generated audio tokens, representing the query information, while K and V correspond to the textual tokens.

The Proposed Audio Gen X Audio Gen X addresses the challenge of explaining TAG models, where the goal is to quantify the importance of textual input corresponding to the generated audio. To achieve this within a sequence-to-sequence framework, we decompose the explanation target, represented as sequential audio, into individually non-sequential audio tokens. Since the output is sequential data, calculating gradients across the entire sequence, from the first to the last token, is computationally expensive and time-consuming. To overcome these issues, we redefine the explanation target as individual audio tokens, rather than the entire sequence. This modification enables parallel computation of generating an explanation for each token, significantly speeding up the process. Finally, Audio Gen X integrates these individual token-level explanations to provide a comprehensive understanding of the entire audio sequence. An overview of Audio Gen X is illustrated in Figure 2.

Definition of Masks as Explanations We quantify the importance of the t-th audio token zt within the audio sequence using a mask as the explanation. The soft mask is denoted as MU,zt RL 1, where each element mui,zt MU,zt represents the importance of the i-th textual token with respect to the t-th audio token zt. Each value lies in the range [0, 1], where a value close to 1 indicates that the corresponding textual token is highly important for generating the target audio token, while a value closer to 0 indicates lower importance. To serve as a soft mask representing the importance of each text token, Audio Gen X optimizes the Explainer to predict the mask MU,zt as the explanation. The Explainer consists of Multi-Layer Perceptrons (MLPs) with a sigmoid and gumbel-softmax (Jang et al. 2017) function to constrain values within the range [0, 1] without additional scaling and to enforce the values close to either 0 or 1, thereby highlighting relatively distinguished contribution. Using the soft mask, we apply perturbation to modify the inner computational steps of the cross-attention layers, altering the attention score of the given textual input. Consequently, we measure the perturbation effect on the prediction at the layer f(U, zt 1) = et, observing how latent representation vector et for the audio token zt changes under these perturbations. In the following section, we detail how we optimize Explainer to predict the mask as explanations based on both factual and counterfactual reasoning.

Formulating Factual Explanations Factual reasoning (Tan et al. 2022; Ali et al. 2023; Kenny et al. 2021) aims to find sufficient input that can approximately reproduce the original prediction. To quantify the sufficiency of textual tokens, we employ a perturbationbased method using the soft mask, interleaving the computation to measure the impact of changes. Specifically, we mask out attention scores in the cross-attention layers where textual information is fed into the TAG model. We formulate the perturbation in factual reasoning as:

g(Q, K, V, MU,zt) = (σ(QK

dk ) MU,zt)V, (2)

Causal Self-Attention

Continuous rain falling on surface

Text Encoder

𝑺𝒐𝒇𝒕𝒎𝒂𝒙(𝑸𝑲𝑻

Cross Attention ...

(a) (b) (c)

n Attention blocks

Causal Self-Attention

Cross-Attention

Look-up-table

Linear Layers & Top-k/p Sampling

Explainer 𝑼

Continuous rain falling on surface

Text Encoder

Audio token embedding space

Figure 2: (a) The process by which Audio Gen generates an audio. (b) Audio Gen X s procedure for generating and applying explanations, with the Explainer in the green box. (c) The method for calculating and applying the loss in Audio Gen X.

where σ denotes the softmax activation function and the mask MU,zt controls the amount of information corresponding to the text token. When the mask value mui,zt approaches 0, the attention score is suppressed, meaning the information corresponding to the textual token is not fully propagated to the subsequent layer. Conversely, as the mask value approaches 1, the original value is fully preserved. To distinguish this process from the original layer f(U, zt 1), we denote the layer applying perturbation with the factual mask as f(U, zt 1, MU,zt). When the mask sufficiently serves as a factual explanation, the perturbed output remains approximately the same as the original prediction. To evaluate the impact of perturbation, we measure the resulting changes in the latent representation vector within the audio token space. Since the latent vector encodes rich and implicit information, we expect that two vectors close to each other indicate a similar auditory meaning, which is likely to result in similar audio generation. By leveraging this vector similarity, we can effectively measure the influence of perturbation and formulate the objective function for Explainer as:

LF = cos(f(U, zt 1), f(U, zt 1, MU,zt)), (3)

where cos function refers to cosine similarity, which measures how similar factual result f(U, zt 1, MU,zt) is to the original prediction f(U, zt 1) in the audio token space. Since the objective function involves negative cosine similarity, minimizing the loss function corresponds to maximizing the similarity. Hence, following the objective function, the Explainer generates the factual explanation mask, ensuring that the two representations or predictions are as close as possible in the audio token space.

Formulating Counterfactual Explanations

Counterfactual reasoning (Tan et al. 2022; Ali et al. 2023; Kenny et al. 2021) aims to identify necessary inputs that can significantly alter the original prediction when it is perturbed or removed. This perturbation operates in the opposite direction of factual explanations, removing the important input to observe the counterfactual result. We formulate the pertur-

bation method in counterfactual reasoning as:

g(Q, K, V, 1 MU,zt) = (σ(QK

dk ) (1 MU,zt))V, (4)

where 1 R1 Tu is a vector of ones and 1 MU,zt subtracts the importance of the corresponding textual tokens. Consequently, the more important a textual token is, the more its attention score is suppressed in proportion to its importance. This perturbation operates under a counterfactual assumption as the What-If scenario (Tan et al. 2022; Ali et al. 2023; Kenny et al. 2021): What happens if the important textual token does not exist? After applying the perturbation in Equation (4), the counterfactual result is observed as f(U, zt 1, 1 MU,zt). If the counterfactual result significantly differs from the original prediction, it indicates that the counterfactual mask is necessary to explain the original prediction. Conversely, if the change is trivial, the counterfactual mask is unnecessary to explain the causal relationship with the prediction. Generally, counterfactual explanations in supervised settings aim to find the important inputs that change the prediction with minimal perturbation. However, no class labels or guidance are available in our task of audio generation. Instead, we measure the change of meaning in latent space leveraging the cosine similarity function after counterfactual perturbation. Thus, the counterfactual explanation objective function is formulated as: LCF = cos(f(U, zt 1), f(U, zt 1, 1 MU,zt)), (5) where cos function measures how dissimilar counterfactual result f(U, zt 1, 1 MU,zt) is to the original prediction in latent space. As the cosine similarity decreases, the objective function minimizes the similarity. Consequently, the Explainer generates the counterfactual explanation mask to ensure that the two representations or predictions are as far as possible in the audio token space after counterfactual perturbation.

Objective Function for Audio Gen X Along with factual and counterfactual explanation objective functions, we add the regularization term to generate the explanation mask in a simple and efficient manner. Therefore,

Algorithm 1: Audio Gen X

Input: Textual token representation vector U, previously generated audio token vector zt 1, Transformer model f, audio generation length T, number of epochs K, learning rate λ, regularization coefficients α and β for t = 1 to T do Initialize Explainer with random parameters. for epoch = 1 to K do MU,zt = Explainer(U, zt 1) L = LF + LCF + α L1(MU,zt) + β L2(MU,zt) θ := θ λ θL end for MU,zt = Explainer(U, zt 1) end for

Return MU,z = 1

we incorporate additional regularization in our final objective function for the Explainer, which is formulated as:

L = LF + LCF + α L1(MU,zt) + β L2(MU,zt). (6)

Here, L1 and L2 represent the L1-Norm and L2-Norm, respectively, as regularization terms to minimize the mask size. This prevents a trivial solution where the Explainer generates an explanation mask assigning equal importance to all values. At the same time, adhering to Occam s Razor principle, we favor simpler and more effective explanations (Tan et al. 2022; Blumer et al. 1987). Hence, according to the objective function in Equation (6), Audio Gen X optimizes the Explainer generating faithful explanation masks in the audio-token level.

Providing Audio-Level Explanations

In this section, we aggregate audio token-level explanations to provide a comprehensive understanding of the entire audio sequence. The aggregation is performed by averaging the mask values across all audio tokens as follows:

t=1 MU,zt, (7)

where t refers to the step, and T represents the total length of generated audio. Additionally, it is possible to focus on a specific interval of interest within the audio, defined between a starting step s and an ending step n. This is denoted as MU,z = 1 |n s|+1 Pn t=s MU,zt, which provides granular explanations based on the user s request. This flexible approach enables users to discover patterns within specific intervals, as Audio Gen X can effectively capture and explain auditory content in targeted regions of the audio sequence.

Experimental Setup

Dataset. We use Audio Caps (Kim et al. 2019) as the source of textual prompts. For each prompt, we generate a 5-second

audio clip using Audio Gen, pairing each prompt with its corresponding generated audio. For hyperparameter tuning, we select 100 validation captions, while the test dataset consists of 1,000 randomly selected captions. Evaluation Metrics. We evaluate explanations based on two metrics: Fidelity and KL divergence, both derived from the classification probabilities of a pre-trained audio classifier. Specifically, we utilize Pa SST (Cai et al. 2022), a classifier trained on the Audio Set dataset, which is also used in the evaluation of Audio Gen. Its classification probabilities are likely to provide meaningful insights into the relationship between textual prompts and generated audio. Fidelity (Yuan et al. 2021; Ali et al. 2023), a core evaluation metric in the field of XAI, measures the change in top-1 label prediction probabilities of the generated audio after applying factual and counterfactual explanation masks, denoted as Fid F and Fid CF , respectively. In addition, KL divergence (Kilgour et al. 2018), originally used to evaluate audio generative models (Kreuk et al. 2023; Yang et al. 2023; Huang et al. 2023), measures the differences of label distribution between generated and reference audio. For explanation evaluation, we introduce new metrics KLF and KLCF , which measure the conceptual change in the generated audio after applying explanation masks in factual and counterfactual reasoning, respectively. In factual evaluation, the generated audio should closely match the original audio, making lower values Fid F and KLF desirable. In contrast, in counterfactual evaluation, higher values of Fid CF and KLCF indicate a more effective explanation. Additionally, we include the average mask size as part of our evaluation. Baselines. We compare our method with five baselines. Random-Mask is a mask with randomly assigned values ranging between 0 and 1. Grad-CAM (Selvaraju et al. 2017) is evaluated in two variations: Grad-CAM-a and Grad-CAM-e. Specifically, Grad-CAM-a computes the gradients of the latent representation vector of the t-th audio token et with respect to the generated audio sequence zt, while Grad-CAM-e computes the gradients of the last crossattention map to the zt. We also include the At Man (Deiseroth et al. 2023) and the method proposed by Chefer et al. (Chefer, Gur, and Wolf 2021) as baselines. Experimental Setting. The Explainer model includes a linear layer that reduces the text token embeddings from 1536 to 512 dimensions, followed by a PRe LU activation function. The 512-dimensional text token embeddings are then mapped to a single value through another linear layer and a sigmoid function, producing a value in the [0, 1] range. A Gumbel-Softmax function is subsequently applied to push values closer to 0 or 1, representing the importance of each text token. The Explainer is trained for 50 epochs with a learning rate as 10 3 using the Adam optimizer. Hyperparameters are set as α = 1 10 3 and β = 1 10 1 as coefficients for the explanation objective function. Hyperparameter sensitivity analysis and detailed experimental settings are both provided in the Appendix. Our code is available at the following link 1.

1https://github.com/hjkng/audiogen X

Method Fid F Fid CF KLF KLCF Size

Naudio = 5 0.128 0.004 - 1.318 0.030 - - Random-Mask 0.196 0.004 0.195 0.006 1.884 0.044 1.932 0.046 0.500 Grad-CAM-e 0.204 0.006 0.235 0.008 1.858 0.034 2.457 0.041 0.422 Grad-CAM-a 0.240 0.006 0.192 0.010 2.285 0.077 1.951 0.075 0.406 At Man 0.195 0.008 0.222 0.008 2.010 0.049 2.198 0.048 0.497 Chefer et al. 0.198 0.003 0.229 0.004 1.899 0.025 2.348 0.040 0.441 Audio Gen X w/ Eq. (3) 0.145 0.004 0.360 0.005 1.542 0.024 3.658 0.061 0.360 Audio Gen X w/ Eq. (5) 0.143 0.004 0.385 0.005 1.514 0.043 3.977 0.044 0.385 Audio Gen X w/ Eq. (7) 0.137 0.005 0.402 0.005 1.418 0.043 4.183 0.073 0.455 Audio Gen X 0.132 0.004 0.405 0.004 1.416 0.029 4.259 0.039 0.455

Table 1: Evaluation of explanations generated by each method using factual and counterfactual reasoning. Five audio samples are generated and evaluated with different seeds based on the obtained explanations. The best results are highlighted in bold.

ticktocks with

Grad-CAM-a tick

0.38 Grad-CAM-e tick

0.18 At Man tick

Audio Gen X

Chefer et al.

Figure 3: Visualization of Audio Gen X and other methods.

Experimental Results

RQ 1: Does Audio Gen X Generate Faithful Explanations?

We evaluate the generated explanations by Audio Gen X based on factual and counterfactual reasoning, as presented in Table 1. Audio Gen X achieves the best performance across the metrics Fid F , Fid CF , KLF , and KLCF , while also maintaining the smallest size (Size), demonstrating that our explanations are both simple and effective. The baseline, denoted as Naudio = 5, generates audio conditioned on the same textual input five times to observe the inherent variance, serving as the lower bound for Fid F , KLF . Audio Gen X s factual audio nearly reaches the lower bound, indicating high performance. Furthermore, significant changes in Fid CF and KLCF under counterfactual perturbations confirm that the explanations are both sufficient and necessary. The Audio Gen X with factual and counterfactual losses in Eq.(6), outperforms the variants Audio Gen X w/ Eq. (3) and Audio Gen X w/ Eq. (5), which apply only factual or counterfactual loss with a regularization term. This indicates that the two losses complement each other, enhancing overall performance. Furthermore, we evaluate Audio Gen X w/ Eq. (7) using an averaged explanation mask, showing the robustness of explainability in describing the entire audio. In contrast, other baselines fail to generate meaningful counterfactual audio, lacking the optimization properties needed to enforce counterfactual explanations.

The strong performance highlights the effectiveness of leveraging latent embedding vectors to generate explanations. While most baselines are designed to explain supervised learning models, they rely on vectors that represent the probability distribution of the final audio token. This approach, however, does not align well with the inference process of audio generation models. In extreme cases, such as top-k sampling (k=250), the 250-th audio token could be sampled, leading to significant discrepancies between the gradients or probability-related information the token most likely predicted by the model. In contrast, our approach avoids dependency on the sampling process, allowing the model to produce more faithful explanations.

RQ 2: How Well Do the Explanations from Audio Gen X Reflect the Generated Audio? We visualize the explanations generated by Audio Gen X and other baselines, as shown in Figure 3. Audio Gen X demonstrates a clear advantage in focusing on key audio elements. Unlike other baselines, which often assign relatively high importance scores to less important tokens like A and with , Audio Gen X consistently assigns higher importance scores to crucial tokens such as ticktocks and music . For instance, Audio Gen X assigns a notably high importance score of 0.96 to music , emphasizing its ability to focus on significant input tokens. In contrast, other models like Grad-CAM-e and At Man distribute importance more broadly, including less relevant tokens. These results show that Audio Gen X consistently provides faithful explanations, aligning the generated audio with the essential components of the input text. Furthermore, when generating audio from a prompt containing multiple concepts, some words may be less prominently reflected. In such case, Audio Gen X provides adequate explanations for each specific audio, indicating whether each word from the prompt has been incorporated into the generated audio. As illustrated in Figure 4, the difference between the two audios is that bird sounds are present in Figure 4-(a) but absent in Figure 4-(b). Audio Gen X effectively describes the audios by assigning high importance scores of 0.98 and 0.99 to the token Water, which is the primary sound in both audios. Audio Gen X assigns a

Water falls and splashes, birds singing

singing 0.11

singing 0.07

splash 0.76

splash 0.38

Figure 4: Explanation generated by Audio Gen X for two audios created from a single prompt. (a) includes bird sounds, while (b) does not.

please generate a sound of rain without thunder

please generate a sound of rain without no thunder

please 0.14

please 0.04

generate 0.07

generate 0.03

without 0.07

without 0.22

thunder 0.98

thunder 0.97

Figure 5: Explanations generated from negated prompts: (a) single negation, (b) double negation.

score of 0.54 to birds, while it assigns a score of 0.14, accurately reflecting the different audio characteristics in each case. These results show that Audio Gen X can provide explanations that are well-suited to the corresponding audio. Furthermore, these explanations serve as valuable insights for editing generated audio to better align with user intention.

RQ 3: How Can Explanations Help Understand Audio Gen Behavior? We explore the output patterns of Audio Gen using the explanations generated by Audio Gen X. First, we investigate whether Audio Gen can effectively handle sentences containing negations and double negations, as shown in Figure 5. The explanations of the generated audios are presented in response to input prompts containing without thunder and without no thunder. In both cases, the generated audio includes the sound of thunder along with the rain. Using Audio Gen X, we observe that without and without no have lower importance compared to thunder in the explanations. We hypothesize that this occurs because the training dataset lacks sufficient examples of negation and double negation. An examination of the Audio Caps dataset reveals a scarcity of such cases. Additionally, by aggregating tokens from the explanations, we identify the top and bottom 50 tokens in Table 3 in the Appendix. Tokens with high importance are predominantly nouns, such as thunder, while those with low importance include sound descriptors like distant, as well as sequential expressions like before. Such analyses could be used to debug TAG models or to identify potential inherent biases in their behavior.

RQ 4: Does Audio Gen X Generate Explanations Efficiently? We evaluate the efficiency of explanation methods based on the average time and total GPU memory usage per explana-

Method Memory (MB) Time (s)

Grad-CAM-e 8641.306 49.038 Grad-CAM-a 41655.848 62.276 At Man 5081.957 7.295 Chefer et al. 41684.969 52.166 Audio Gen X w/ Eq. (3) 11980.894 36.639 Audio Gen X w/ Eq. (5) 11981.114 37.373 Audio Gen X w/ Eq. (7) 12001.931 63.198 Audio Gen X 12001.931 63.198

Table 2: Efficiency analysis of Audio Gen X and other baseline methods. The best results are highlighted in bold.

tion, as shown in Table 2. For GPU memory efficiency, the results rank in the following order: At Man, Grad-CAM-e, Audio Gen X, Grad-CAM-a, and Chefer et al. For time efficiency, the order is At Man, Grad-CAM-e, Chefer et al., Grad-CAM-a, and Audio Gen X. Although At Man is the most efficient, its performance remains subpar due to its simplistic approach. Grad-CAM-e demonstrates greater memory efficiency compared to Grad-CAM-a and Chefer et al., as it tracks a shallower layer. While Audio Gen X requires additional computational time to train explanation masks, it achieves memory efficiency by reducing GPU storage and operates with O(Lk) complexity, ensuring linear scalability for large-scale tasks.

Audio Gen X quantifies the importance of textual tokens corresponding to generated audio by leveraging both factual and counterfactual reasoning frameworks. This approach enables the generation of faithful explanations, providing actionable insights for users to edit audio and assisting developers in debugging. Consequently, Audio Gen X enhances the transparency and trustworthiness of TAG models. For comprehensive details on experimental settings, additional results, and hyperparameter sensitivity analyses, please refer to the Appendix at https://learndatalab.github.io/audiogenx. html.

Acknowledgements

This work was supported by NCSOFT, the Institute of Information & Communications Technology Planning & evaluation (IITP) grant, and the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (RS-2019-II190421, IITP2025-RS-2020-II201821, RS-2024-00438686, RS-202400436936, RS-2024-00360227, RS-2023-0022544, NRF2021M3H4A1A02056037, RS-2024-00448809). This research was also partially supported by the Culture, Sports, and Tourism R&D Program through the Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2024 (RS-2024-00333068, RS-202400348469 (25%)).

Abnar, S.; and Zuidema, W. 2020. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4190 4197. Association for Computational Linguistics.

Akman, A.; and Schuller, B. W. 2024. Audio Explainable Artificial Intelligence: A Review. Intelligent Computing, 2: 0074.

Ali, S.; Abuhmed, T.; El-Sappagh, S.; Muhammad, K.; Alonso-Moral, J. M.; Confalonieri, R.; Guidotti, R.; Del Ser, J.; D ıaz-Rodr ıguez, N.; and Herrera, F. 2023. Explainable Artificial Intelligence (XAI): What we know and what is left to attain Trustworthy Artificial Intelligence. Information fusion, 99: 101805.

Becker, S.; Ackermann, M.; Lapuschkin, S.; M uller, K.- R.; and Samek, W. 2018. Interpreting and explaining deep neural networks for classification of audio signals. ar Xiv preprint ar Xiv:1807.03418.

Blumer, A.; Ehrenfeucht, A.; Haussler, D.; and Warmuth, M. K. 1987. Occam s razor. Information Processing Letters, 24(6): 377 380.

Cai, J.; Fan, J.; Guo, W.; Wang, S.; Zhang, Y.; and Zhang, Z. 2022. Efficient deep embedded subspace clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1 10.

Chefer, H.; Gur, S.; and Wolf, L. 2021. Generic attentionmodel explainability for interpreting bi-modal and encoderdecoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 397 406.

D efossez, A.; Copet, J.; Synnaeve, G.; and Adi, Y. 2023. High Fidelity Neural Audio Compression. Transactions on Machine Learning Research.

Deiseroth, B.; Deb, M.; Weinbach, S.; Brack, M.; Schramowski, P.; and Kersting, K. 2023. At Man: Understanding transformer predictions through memory efficient attention manipulation. Advances in Neural Information Processing Systems, 36.

Frommholz, A.; Seipel, F.; Lapuschkin, S.; Samek, W.; and Vielhaben, J. 2023. XAI-based Comparison of Input Representations for Audio Event Classification. In Proceedings of the International Conference on Content-Based Multimedia Indexing, 126 132.

Haunschmid, V.; Manilow, E.; and Widmer, G. 2020. audio LIME: Listenable Explanations Using Source Separation. Proceedings of the International Workshop on Machine Learning and Music. ar Xiv:2008.00582.

Huang, R.; Huang, J.; Yang, D.; Ren, Y.; Liu, L.; Li, M.; Ye, Z.; Liu, J.; Yin, X.; and Zhao, Z. 2023. Make-an-audio: Textto-audio generation with prompt-enhanced diffusion models. In Proceedings of the International Conference on Machine Learning, 13916 13932.

Jang; Eric; Gu; Shixiang; Poole; and Ben. 2017. Categorical reparameterization with gumbel-softmax. In Proceedings of the International Conference on Learning Representations.

Kenny, E. M.; Delaney, E. D.; Greene, D.; and Keane, M. T. 2021. Post-hoc explanation options for XAI in deep learning: The Insight centre for data analytics perspective. In Proceedings of the ICPR International Workshops and Challenges, 20 34. Kilgour, K.; Zuluaga, M.; Roblek, D.; and Sharifi, M. 2018. Fr echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms. ar Xiv preprint ar Xiv:1812.08466. Kim, C. D.; Kim, B.; Lee, H.; and Kim, G. 2019. Audiocaps: Generating captions for audios in the wild. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 119 132. Kreuk, F.; Synnaeve, G.; Polyak, A.; Singer, U.; D efossez, A.; Copet, J.; Parikh, D.; Taigman, Y.; and Adi, Y. 2023. Audiogen: Textually guided audio generation. In Proceedings of the International Conference on Learning Representations. Liu, H.; Chen, Z.; Yuan, Y.; Mei, X.; Liu, X.; Mandic, D.; Wang, W.; and Plumbley, M. D. 2023. Audioldm: Text-toaudio generation with latent diffusion models. In Proceedings of the International Conference on Machine Learning, 21450 21474. Lundberg, S. M.; and Lee, S.-I. 2017. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30. Nagahisarchoghaei, M.; Nur, N.; Cummins, L.; Nur, N.; Karimi, M. M.; Nandanwar, S.; Bhattacharyya, S.; and Rahimi, S. 2023. An empirical survey on explainable ai technologies: Recent trends, use-cases, and categories from technical and application perspectives. Electronics, 12(5): 1092. Pearl, J. 2009. Causal inference in statistics: An overview. Statistics Surveys, 3: 96 146. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the limits of transfer learning with a unified textto-text transformer. Journal of Machine Learning Research, 21(140): 1 67. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why should I trust you? Explaining the predictions of any classifier. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135 1144. Samek, W.; Wiegand, T.; and M uller, K.-R. 2017. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. ar Xiv preprint ar Xiv:1708.08296. Schneider, F.; Kamal, O.; Jin, Z.; and Sch olkopf, B. 2023. Moˆusai: Text-to-music generation with long-context latent diffusion. ar Xiv preprint ar Xiv:2301.11757. Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, 618 626.

Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic attribution for deep networks. In Proceedings of the International Conference on Machine Learning, 3319 3328. Tan, J.; Geng, S.; Fu, Z.; Ge, Y.; Xu, S.; Li, Y.; and Zhang, Y. 2022. Learning and evaluating graph neural network explanations based on counterfactual and factual reasoning. In Proceedings of the ACM Web Conference, 1018 1027. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in Neural Information Processing Systems, 30. Yang, D.; Yu, J.; Wang, H.; Wang, W.; Weng, C.; Zou, Y.; and Yu, D. 2023. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31: 1720 1733. Yuan, H.; Yu, H.; Wang, J.; Li, K.; and Ji, S. 2021. On explainability of graph neural networks via subgraph explorations. In Proceedings of the International Conference on Machine Learning, 12241 12252. Ziv, A.; Gat, I.; Lan, G. L.; Remez, T.; Kreuk, F.; D efossez, A.; Copet, J.; Synnaeve, G.; and Adi, Y. 2024. Masked Audio Generation using a Single Non-Autoregressive Transformer. In Proceedings of the International Conference on Learning Representations.