# defending_against_adversarial_audio_via_diffusion_model__9d46b750.pdf

Published as a conference paper at ICLR 2023

DEFENDING AGAINST ADVERSARIAL AUDIO VIA DIFFUSION MODEL

Shutong Wu1,2 Jiongxiao Wang1 Wei Ping3 Weili Nie3 Chaowei Xiao1

1Arizona State University 2Shanghai Jiao Tong University 3NVIDIA

Deep learning models have been widely used in commercial acoustic systems in recent years. However, adversarial audio examples can cause abnormal behaviors for those acoustic systems, while being hard for humans to perceive. Various methods, such as transformation-based defenses and adversarial training, have been proposed to protect acoustic systems from adversarial attacks, but they are less effective against adaptive attacks. Furthermore, directly applying the methods from the image domain can lead to suboptimal results because of the unique properties of audio data. In this paper, we propose an adversarial purification-based defense pipeline, Audio Pure, for acoustic systems via offthe-shelf diffusion models. Taking advantage of the strong generation ability of diffusion models, Audio Pure first adds a small amount of noise to the adversarial audio and then runs the reverse sampling step to purify the noisy audio and recover clean audio. Audio Pure is a plug-and-play method that can be directly applied to any pretrained classifier without any fine-tuning or re-training. We conduct extensive experiments on speech command recognition task to evaluate the robustness of Audio Pure. Our method is effective against diverse adversarial attacks (e.g. L2 or L -norm). It outperforms the existing methods under both strong adaptive white-box and black-box attacks bounded by L2 or L - norm (up to +20% in robust accuracy). Besides, we also evaluate the certified robustness for perturbations bounded by L2-norm via randomized smoothing. Our pipeline achieves a higher certified accuracy than baselines. Code is available at https://github.com/cychomatica/Audio Pure.

1 INTRODUCTION

Deep neural networks (DNNs) have demonstrated great successes in different tasks in the audio domain, such as speech command recognition, keyword spotting, speaker identification, and automatic speech recognition. Acoustic systems built by DNNs (Amodei et al., 2016; Shen et al., 2019) are applied in safety-critical applications ranging from making phone calls to controlling household security systems. Although DNN-based models have exhibited significant performance improvement, extensive studies have shown that they are vulnerable to adversarial examples (Szegedy et al., 2014; Carlini & Wagner, 2018; Qin et al., 2019; Du et al., 2020; Abdullah et al., 2021; Chen et al., 2021a), where attackers add imperceptible and carefully crafted perturbations to the original audio to mislead the system with incorrect predictions. Thus, it becomes crucial to design robust DNN-based acoustic systems against adversarial examples.

To address it, existing works (e.g., Rajaratnam & Alshemali, 2018; Yang et al., 2019) have tried to leverage the temporal dependency property of audio to defend against adversarial examples. They apply the time-domain and frequency-domain transformations to the adversarial examples to improve the robustness. Although they can alleviate this problem to some extent, they are still vulnerable against strong adaptive attacks where the attacker obtains full knowledge of the whole acoustic system (Tramer et al., 2020). Another way to enhance the robustness against adversarial examples is adversarial training (Goodfellow et al., 2015; Madry et al., 2018) that adversarial perturbations have been added to the training stage. Although it has been acknowledged as the most effective defense, the training process will require expensive computational resources and the model is still

work done during the internship at ASU.

Published as a conference paper at ICLR 2023

vulnerable to other types of adversarial examples that are not similar to those used in the training process (Tramer & Boneh, 2019).

Adversarial purification (Yoon et al., 2021; Shi et al., 2021; Nie et al., 2022) is another family of defense methods that utilizes generative models to purify the adversarial perturbations of the input examples before they are fed into neural networks. The key of such methods is to design an effective generative model for purification. Recently, diffusion models have been shown to be the state-of-the-art models for images (Song & Ermon, 2019; Ho et al., 2020; Nichol & Dhariwal, 2021; Dhariwal & Nichol, 2021) and audio synthesis (Kong et al., 2021; Chen et al., 2021b). It motivates the community to use it for purification. In particular, in the image domain, Diff Pure (Nie et al., 2022) applies diffusion models as purifiers and obtains good performance in terms of both clean and robust accuracy on various image classification tasks. Since such methods do not require training the model with pre-defined adversarial examples, they can generalize to diverse threats. Given the significant progress of diffusion models made in the image domain, it motivates us to ask: is it possible to obtain similar success in the audio domain?

Unlike the image domain, audio signals have some unique properties. There are different choices of audio representations, including raw waveforms and various types of time-frequency representations (e.g., Mel spectrogram, MFCC). When designing an acoustic system, some particular audio representations may be selected as the target features, and defenses that work well on some features may perform poorly on other features. In addition, one may think of treating the 2-D time-frequency representations (i.e., spectrogram) as images, where the frequency-axis is set as height and the timeaxis is set as width, then directly apply the successful Diff Pure (Nie et al., 2022) from the image domain for spectrogram. Despite the simplicity, there are two major issues: i) the acoustic system can take audio with variable time duration as the input, while the underlying diffusion model within Diff Pure can only handle inputs with fixed width and height. ii) Even if we apply it in a fixed-length segment-wise manner for the time being, it still achieves the suboptimal results as we will demonstrate in this work. These unique issues pose a new challenge of designing and evaluating defense systems in the audio domain.

In this work, we aim to defend against diverse unseen adversarial examples without adversarial training. We propose a play-and-plug purification pipeline named Audio Pure based on a pre-trained diffusion model by leveraging the unique properties of audio. In specific, our model consists of two main components: (1) a waveform-based diffusion model and (2) a classifier. It takes the audio waveform as input and leverages the diffusion model to purify the adversarial audio perturbations. Given an adversarial input formatted with waveform, Audio Pure first adds a small amount of noise via the diffusion process to override the adversarial perturbations, and then uses the truncated reverse process to recover the clean sample. The recovered sample is fed into the classifier.

We conduct extensive experiments to evaluate the robustness of our method on the task of speech command recognition. We carefully design the adaptive attacks so that the attacker can accurately compute the full gradients to evaluate the effectiveness of our method. In addition, we also comprehensively evaluate the robustness of our method against different black-box attacks and the Expectation Over Transformation (EOT) attack. Our method shows a better performance under both white-box and black-box attacks against diverse adversarial examples. Moreover, we also evaluate the certified robustness of Audio Pure via randomized smoothing, which offers a provable guarantee of model robustness against L2-based perturbation. We show that our method achieves better certified robustness than baselines. Specifically, our method obtains a significant improvement (up to +20% at most in robust accuracy) compared to adversarial training, and over 5% higher certified robust accuracy than baselines. To the best of our knowledge, we are the first to use diffusion models to enhance the security of acoustic systems and investigate how different working domains of defenses affect adversarial robustness.

2 RELATED WORK

Adversarial attacks and defenses. Szegedy et al. (2014) introduce adversarial examples, which look similar to normal examples but will fool the neural networks to give incorrect predictions. Usually, adversarial examples are constrained by Lp norm to ensure the imperceptibility. Recently, stronger attack methods are emerging (Madry et al., 2018; Carlini & Wagner, 2017; Andriushchenko et al., 2020; Croce & Hein, 2020; Xiao et al., 2018a;b; 2019; 2022b;a; Cao et al., 2019b;a; 2022a).

Published as a conference paper at ICLR 2023

Adversarial

Diffusion Reverse

Diffusion Model

𝑛𝑛= 0 𝑛𝑛= 𝑛𝑛 𝑛𝑛= 0

Adversarial Audio Purified Audio

Update adversarial audio using the gradients backpropagated through SDE

STFT & Rescale Audio Pure

Figure 1: The architecture of the whole acoustic system protected by Audio Pure (black line in the figure) and the adaptive attack (orange line in the figure). Audio Pure first adds noise to the adversarial audio and then runs the reverse process to recover purified audio. Next, the purified audio is transformed into the spectrogram, and the spectrogram is fed into the classifier to get predictions. The attacker updates the adversarial audio based on the gradients backpropagated through SDE. Without Audio Pure , the adversarial audio transfers to the spectrogram and feeds into the classifier directly.

In the audio domain, Carlini & Wagner (2018) introduce audio adversarial examples, and Qin et al. (2019) manage to make them more imperceptible. Black-box attacks (Du et al., 2020; Chen et al., 2021a) are also developed, aiming to mislead the end-to-end acoustic systems.

In order to protect neural networks from adversarial attacks, different defense methods are proposed. The most widely used one is adversarial training (Madry et al., 2018), which deliberately uses adversarial examples as the training data of neural networks. The main problems of adversarial training are the accuracy drop of benign examples and the expensive computational cost. Many improved versions of adversarial training aim to alleviate these problems (Wong et al., 2020; Shafahi et al., 2019; Zhang et al., 2019b;a; Sun et al., 2021; Cao et al., 2022b; Zhang et al., 2019c). Another line of work is adversarial purification (Yoon et al., 2021; Shi et al., 2021; Nie et al., 2022), which uses generative models to remove the adversarial perturbations before classification. Both of these two types of defenses are mainly developed for computer vision tasks and cannot be directly applied to the audio domain. In this paper, we explicitly design a defense pipeline according to the characteristics of audio data.

Speech processing. Many speech processing applications are vulnerable to adversarial attacks, including speech command recognition (Warden, 2018), keyword spotting (Chen et al., 2014; Li et al., 2019), speaker identification (Reynolds et al., 2000; Ravanelli & Bengio, 2018; Snyder et al., 2018), and speech recognition (Amodei et al., 2016; Shen et al., 2019; Ravanelli et al., 2019). In particular, speech command recognition is closely related to keyword spotting, and can be viewed as speech recognition with limited vocabulary. In this work, we choose speech command recognition as the testbed for the proposed Audio Pure pipeline. The proposed pipeline is applicable for keyword spotting and speech recognition.

A speech command recognition system consists of a feature extractor and a classifier. The feature extractor processes the raw audio waveforms and outputs acoustic features, e.g. Mel spectrograms or Mel-frequency cepstral coefficients (MFCC). Then these features are fed into the classifier, and the classifier gives predictions. Given the 2-D spectrogram features, convolutional neural networks for images are readily applicable (Simonyan & Zisserman, 2015; He et al., 2016; Zagoruyko & Komodakis, 2016; Xie et al., 2017; Huang et al., 2017).

3.1 BACKGROUND OF DIFFUSION MODELS

A diffusion model normally consists of a forward diffusion process and a reverse sampling process. The forward diffusion process gradually adds gaussian noise to the input data until the distribution

Published as a conference paper at ICLR 2023

of the noisy data converges to a standard Gaussian distribution. The reverse sampling process takes the standard gaussian noise as input and gradually denoises the noisy data to recover clean data. At present, diffusion models can be divided into two different types: discrete-time diffusion models based on sequential sampling, such as SMLD Song & Ermon (2019), DDPM (Ho et al., 2020), and DDIM (Song et al., 2021a), and continuous-time diffusion models based on SDEs (Song et al., 2021c). Song et al. (2021c) also build the connection between these two types of diffusion models.

Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) is one of the most widely used diffusion models. Many of the subsequently proposed diffusion models, including Diff Wave for audio (Kong et al., 2021), are based on the DDPM formulation. In DDPM, both the diffusion and reverse processes are defined by Markov chains. For input data x0 Rd, we denote x0 q(x0) as the original data distribution, and x1, . . . , x N are intermediate latent variables from the distributions q(x1|x0), . . . , q(x N|x N 1), where N is the total number of steps. Generally, with a pre-defined or learned variance schedule β1, . . . , βN (usually linearly increasing small constants), the forward transition probability q(xn|xn 1) can be formulated as:

q(xn|xn 1) = N(xn; p

1 βnxn 1, βn I), (1)

Based on the variance schedule {βn}, a set of constants is defined as:

αn = 1 βn, αn =

n=1 αn, βn = 1 αn 1

1 αn βn, n > 1 β1, n = 1 , (2)

and using the reparameterization trick, we have:

q(xn|x0) = N(xn; αnx0, (1 αn)I) (3)

When n gradually gets larger to infinity, q(xn|x0) will converge to a standard Gaussian distribution. Meanwhile, for the reverse process, we have:

xn 1 pθ(xn 1|xn) = N(xn 1; µθ(xn, n), σ2 θ(xn, n)I), (4)

where the mean term µθ(xn, n) and the variance term σ2 θ(xn, n) is instantiated by parameter θ. Ho et al. (2020); Kong et al. (2021) use a neural network ϵθ to define µθ, and σθ is fixed to a constant:

µθ(xn, n) = 1 αn

xn βn 1 αn ϵθ(xn, n) , σθ(xn, n) = q

We denote xn(x0, ϵ) = αnx0 + p

(1 αn)ϵ, ϵ N(0, I), and the optimization objective is:

θ = arg max θ

n=1 λn Ex(0) ϵ ϵθ( αnx0 + p

(1 αn)ϵ, n) 2

where λn is the weighting coefficient (Ho et al., 2020).

According to Song et al. (2021c), as N , DDPM becomes VP-SDE, a continuous-time formulation of diffusion models. Particularly, the forward SDE is formulated as:

2β(t)xdt + p

β(t)dw. (7)

where t [0, 1], dt is an infinitesimal positive time step, w is a standard Wiener process, β(t) is the continuous-time noise schedule. Similarly, the reverse SDE can be defined as:

2β(t)[x + 2 x log pt(x)]dt + p

β(t)d w, (8)

where dt is an infinitesimal negative time step, and w is a reverse-time standard Wiener process.

3.2 AUDIOPURE: A PLUG-AND-PLAY DEFENSE FOR ACOUSTIC SYSTEMS

To standardize the formulation of the defense, as suggested by Nie et al. (2022), we use the continuous-time formulation defined by Eq. 7 and Eq. 8. Note that since the existing pretrained Diff Wave models (Kong et al., 2021) are based on DDPM, we will use their equivalent VP-SDE.

Published as a conference paper at ICLR 2023

If we use the Euler-Maruyama method to solve the VP-SDE and the step size t = 1

N , the sampling of the reverse-time SDE will be equivalent to the reverse sampling of DDPM (detailed proofs can be found in Song et al. (2021c)). Under this prerequisite, we have t = n

N where n {1, . . . , N}. We define β( n

N ) := βn, α( n

N ) := αn, β( n

N ) := βn, and x( n

N ) := xn. Given an adversarial example xadv as the input at t = 0, i.e. x0 = xadv, we first run the forward SDE from t = 0 to t = n

N by solving Eq. 7 (it is equivalent to running n DPPM steps), which yields:

α(t )xadv + p

1 α(t )z, z N(0, I), (9)

Next, we run the truncated reverse SDE from t = t to t = 0 by solving Eq. 8. Similar to Nie et al. (2022), we define an SDE solver sdeint that uses the Euler-Maruyama method, and sequentially takes in six inputs: initial value, drift coefficient, diffusion coefficient, Wiener process, initial time, and end time. The reverse output ˆx(0) at t = 0 can be formulated as:

ˆx(0) = sdeint(x(t ), frev, grev, w, t , 0). (10)

where the drift and diffusion coefficients are:

frev(x, t) := 1

2β(t)[x + 2sθ(x, t)], grev(t) := q

Note that we use a diffusion coefficient different from Nie et al. (2022) for the purpose of cleaner output (see the detailed explanation in Section 3.3). Next, we use the discrete-time noise estimator ϵθ(xn, n) to compute the continuous-time score estimator sθ(x, t). By defining ϵθ(x(t), t) := ϵθ(x( n

N ), n) = ϵθ(xn, n) with t := n

N , the score function in the reverse VP-SDE can be estimated as:

sθ(x, t) = ϵθ(x, t) p

1 α(t) x log pt(x). (12)

Accordingly, ˆx(0), the purified output of the adversarial input x(0) = xadv, is fed into the later stages of the acoustic system to make predictions. The whole purification operation can be denoted as a function Purifier : Rd R Rd:

Purifier(xadv, n ) = sdeint

N )z, frev, grev, w, n

The acoustic systems are usually built on the features extracted from the raw audio. For example, the system can extract Mel spectrogram as the features: 1) it first applies short-time Fourier transformation (STFT) on the time-domain waveform to get linear-scale spectrogram, and 2) it then rescales the frequency band to the Mel-scale. We denote this process as Wave2Mel : Rd Rm Rn, which is a differentiable function. Then the classifier F : Rm Rn Rc (usually a convolutional network) takes the Mel spectrogram as the input and gives predictions.

Since both the time domain waveform and time-frequency domain spectrogram go through the pipeline, the purifier can be applied in either the time domain or time-frequency domain. If the purifier is applied in the time domain, the whole defended acoustic system AS : Rd R Rc can be formulated as: AS(xadv, n ) = F(Wave2Mel(Purifier(xadv, n ))) (14) where the waveform Purifier is based on Diff Wave.

Meanwhile, if we want to purify the input adversarial examples in the time-frequency domain, we can choose a diffusion model used for image synthesis, and apply it to the output spectrogram of Wave2Mel. We denote this purifier as Purifierspec : Rm Rn R Rm Rn. In this scenario, the whole defended acoustic system will be:

AS(xadv, n ) = F(Purifierspec(Wave2Mel(xadv)), n ) (15)

The architecture of the whole pipeline is illustrated in Figure. 1. For the purification in the timefrequency domain spectrogram, we use an Improved DDPM (Nichol & Dhariwal, 2021) trained on the Mel spectrograms of audio data and denote it as Diff Spec. We compare these two purifiers and discover that the purification in the time domain waveform is more effective to defend against adversarial audio. Detailed experimental results can be found in Sec. 4.2.

Published as a conference paper at ICLR 2023

3.3 TOWARDS EVALUATING AUDIOPURE

Adaptive attack For the forward diffusion process formulated as Eq. 9, the gradients of the output x(t ) w.r.t. the input x(0) is a constant. For the reverse process formulated as Eq. 10, the adjoint method (Li et al., 2020) is applied to compute the full gradients of the objective function L w.r.t. x(t ) without any out-of-memory issues, by solving another augmented SDE: x(t ) L x(t )

= sdeint x(0) L x(0)

, frev frev

, w(1 t) w(1 t)

, 0, t (16)

where 1 and 0 represent the vectors of all ones and all zeros, respectively.

SDE modifications for clean output We observe that directly applying the framework of Nie et al. (2022) to the audio domain will cause the performance degradation. That is, when converting the discrete-time reverse process of Diff Wave (Kong et al., 2021) to its corresponding reverse VP-SDE in Eq. 8, the output audio still contains much noise, resulting in lower classification accuracy. We identify two influencing factors and solve this problem by modifying the SDE formulation.

The first factor is the diffusion error due to the mismatch of the reverse variance between the discrete and continuous cases. Ho et al. (2020) observed that both σ2 θ = βt and σ2 θ = βt get similar results experimentally in the image domain. However, we find that it is not the case in the audio modeling with diffusion models. For audio synthesis using Diff Wave trained with σ2 θ = βt, if we switch the reverse variance schedule to σ2 θ = βt, the output audio becomes noisy. Thus, in Sec. 3.2 we define β( n

N ) = βn and use the diffusion coefficient grev = q

β(t) in Eq. 11 instead of grev = p

match the variance βt in Diff Wave.

The second factor is the inaccuracy from the continuous-time noise schedule β(t) = β0+(βN β0)t and α(t) = e R t 0 β(s)ds used by Nie et al. (2022). The impact of the difference between β(t) = β0+(βN β0)t and βNt cannot be negligible, especially when N is not large enough (e.g. N = 200 for the pretrained Diff Wave model we use). Besides, when t is close to 0, α(t) = e R t 0 β(s)ds is not a good approximation of αNt any more. Thus, we define the continuous-time noise schedule directly based on the discrete schedule, namely, β( n

N ) := βn and α( n

N ) := αn, for the purpose of better denoised output and more accurate gradient computation.

4 EXPERIMENTS

In this section, we first introduce the detailed experimental settings. Then we compare the performance of our method and other defenses under strong white-box adaptive attack where the attacker has full knowledge about the defense and black-box attacks. To further show the robustness of our method, we also evaluate the certified accuracy via randomize smoothing Cohen et al. (2019), which provides a provable guarantee of model robustness against L2 norm bounded adversarial perturbations.

4.1 EXPERIMENTAL SETTINGS

Dataset. Our method is evaluated on the task of speech command recognition. We use the Speech Commands dataset (Warden, 2018), which consists of 85,511 training utterances, 10,102 validation utterances, and 4,890 tests utterances. Following the setting of Kong et al. (2021), we choose the utterances which stand for digits 0 9 and denote this subset as SC09.

Models. We use Diff Wave (Kong et al., 2021) and Diff Spec (based on Improved DDPM (Nichol & Dhariwal, 2021)) as our defensive purifiers, which are representative diffusion models on the waveform domain and the spectral domain respectively. We use the unconditional version of Diffwave with the officially provided pretrained checkpoints. Since the Improved DDPM model does not provide the pretrained checkpoint for audio, we train it from scratch on the Mel spectrograms of audio from SC09. The training details and hyperparameters are in the appendix A. For the classifier, we use Res Ne Xt-29-8-64(Xie et al., 2017) for spectrogram representation and M5 Net(Dai et al., 2017) for waveform except the experiments for ablation studies.

Published as a conference paper at ICLR 2023

Table 1: Performance against adaptive attacks among different methods.

Defense Clean L white-box L2 white-box L black-box PGD10 PGD20 PGD30 PGD50 PGD70 PGD100 PGD10 PGD20 PGD30 PGD50 PGD100 FAKEBOB

None 100 3 1 1 1 1 1 2 0 0 0 0 21 AS (Yang et al., 2019) 100 4 2 1 1 1 1 1 0 0 0 0 24 MS (Yang et al., 2019) 100 6 3 2 1 1 1 4 1 0 0 0 21 DS (Yang et al., 2019) 99 2 1 1 1 1 1 0 0 0 0 0 16 LPF (Rajaratnam & Alshemali, 2018) 100 5 2 1 1 1 1 2 0 0 0 0 20 BPF (Rajaratnam & Alshemali, 2018) 99 5 1 1 1 1 1 1 1 0 0 0 18 Adv Tr (Madry et al., 2018) 100 86 79 78 74 72 71 73 70 68 65 65 92 Audio Pure 97 89 89 89 85 84 84 89 86 83 85 84 86

Attacks. For white-box attacks, we use PGD (Madry et al., 2018) with different iteration steps from 10 to 100 among L and L2 norms. The attack budget is set to ϵ = 0.002 for L -norm constraint except the ablation study and ϵ = 0.253 for L2 norm constraint. For black-box attacks, we apply a query-based attack, FAKEBOB (Chen et al., 2021a), and set the iteration steps to 200, NES samples to 200, and the confidence coefficient κ = 0.5.

Baselines. We compare our method with two types of baselines including: (1) transformation-based defense (Yang et al., 2019; Rajaratnam & Alshemali, 2018) including average smoothing (AS), median smoothing (MS), downsampling (DS), low-pass filter (LPF), and band-pass filter (BPF), and (2) adversarial training based defense (Adv Tr) (Madry et al., 2018). For adversarial training, we follow the setting of Chen et al. (2022), using L PGD10 with ϵ = 0.002 and ratio = 0.5.

4.2 MAIN RESULTS

We evaluate Audio Pure (n =3 by default) under adaptive attacks, assuming the attacker obtains full knowledge of our defense. We use the adaptive attack algorithm described in the previous section so that the attacker is able to accurately compute the full gradients for attacking. The results are shown in Table 1. We find that the baseline transformation-based defenses (Yang et al., 2019; Rajaratnam & Alshemali, 2018), including average smoothing (AS), median smoothing (MS), downsampling (DS), low-pass filter (LPF), and band-pass filter (BPF), are virtually broken through by white-box attacks with up to 4% robust accuracy. For the adversarial training-based method (Adv Tr) trained on L -norm adversarial examples, although it achieves 71% robust accuracy against L -based adversarial examples, such the method does not work so well on other types of adversarial examples (i.e., L2-based method), achieving 65% robust accuracy under L2-based PDG100 attack. Compared with all baselines, Audio Pure can obtain much higher robust accuracy, about 10% improvements on average, on L -based adversarial examples, and is equally effective against L2-based white-box attacks, achieving 84% robust accuracy.

We also evaluate Audio Pure on black-box attacks including: FAKEBOB (Chen et al., 2021a) and transferability-based attacks. The results of FAKEBOB are shown in Table 1, indicating that our method can keep effective under the query-based black-box attack. The results of the transferabilitybased attack are in the appendix B. They draw the same conclusion. These results further verify the effectiveness of our method. All results indicate that Audio Pure can work under diverse attacks with different types of constraints, while adversarial training has to apply different training strategies and re-train the model, making it less effective among unseen attacks than our method. We report the actual inference time in Appendix J and compare out method with more existing methods in Appendix F, G and H. Additionally, we conduct experiments on the Qualcomm Keyword Speech Dataset (Kim et al., 2019), and the results and details are in Appendix E. In this dataset, our method is still effective against adversarial examples.

4.3 ABLATION STUDY

PGD steps To ensure the effectiveness of PGD attacks, we test different iteration steps from 10 to 150. As Figure. 2a illustrates, the robust accuracy converges after iteration steps n 70.

Effectiveness against Expectation over transformation (EOT) attack. Besides, since the diffusion and reverse process of Audio Pure consist of many randomized sampling operations, we apply the expectation over transformation (EOT) attack to evaluate the effectiveness of Audio Pure with

Published as a conference paper at ICLR 2023

0 25 50 75 100 125 150 PGD steps

Robust Accuracy

None Adv Tr Audio Pure

(a) robust accuracy with different PGD steps

0 5 10 15 20 25 EOT size

Robust Accuracy

None Adv Tr Audio Pure

(b) robust accuracy with different EOT size

Figure 2: The performance of baseline (no defense, denoted as None), adversarial training (denoted as Adv Tr), and Audio Pure under attacks with different iteration steps and EOT size. (a) indicates the step of convergence, and the attack is almost optimal when iterating over 70 steps. (b) shows that increasing EOT size can barely affect the robustness of our method.

Table 2: The robust accuracy under PGD10 with different attack budget ϵ when using different reverse steps n . Larger ϵ requires larger n to ensure better robustness.

Attack Budget Diffusion Steps n = 0 n = 1 n = 2 n = 3 n = 5 n = 7 n = 10

ϵ = 0.002 3 94 90 89 84 77 67 ϵ = 0.004 0 76 89 86 83 74 66 ϵ = 0.008 0 27 70 85 84 74 68 ϵ = 0.016 0 0 21 53 69 57 63

Table 3: Ablation studies among different model architectures. The robust accuracy is evaluated under L - PGD70. Our method is effective on various models with different architectures.

Defense Res Ne Xt-29-8-64 VGG-19-BN Wide Res Net-28-10 Dense Net-BC-100-12 M5 Clean Robust Clean Robust Clean Robust Clean Robust Clean Robust

None 100 1 100 2 100 1 100 5 94 12

Audio Pure 97 84 99 81 99 85 96 79 94 70

different EOT sample sizes. Figure 2b demonstrates the result. We find that Audio Pure is effective among different EOT sizes.

Attack budget ϵ. We evaluate the effectiveness of our method among different ϵ including ϵ = {0.002, 0.004, 0.008, 0.016}. Since the diffusion steps n are the hyperparameters for Audio Pure , we conduct experiments among different n . As shown in Table 2, if n is larger than 2, Audio Pure will show strong effectiveness among different ϵ. When ϵ increases, it requires a larger n to achieve the optimal robustness since a larger adversarial perturbation requires a large noise from the forward process of the diffusion model to override the adversarial perturbations and the corresponding larger step to recover purified audio. However, if the n is too large, it will override the original audio information as well so that the recovered audio from the diffusion model will lose the original audio information, contributing to the performance drop. Furthermore, we explore the extent of the diffusion model for purification in Appendix I. Architectures. Moreover, we apply Audio Pure to different classifiers, including spectrogrambased classifier: VGG-19-BN(Simonyan & Zisserman, 2015), Res Ne Xt-29-8-64(Xie et al., 2017), Wide Res Net-28-10(Zagoruyko & Komodakis, 2016) Dense Net-BC-100-12(Huang et al., 2017) and wave-form based classifier: M5 (Dai et al., 2017). Table 3 shows that our method is effective for various neural network classifiers.

Audio representations Audio has different types of representations including raw waveforms or time-frequency representations (e.g., Mel spectrogram). We conduct an ablation study to show the effectiveness of diffusion models by using different representations, including Diff Wave, a diffusion model for waveforms (Kong et al., 2021) and Diff Spec, a diffusion model for spectrogram based on the original image model (Nichol & Dhariwal, 2021). The results are shown in Table 4. We find

Published as a conference paper at ICLR 2023

Table 4: Ablation studies among different audio representations. We implement Audio Pure using two different diffusion models as purifiers, Diff Wave and Diff Spec, that respectively process the representations in the time domain and time-frequency domain.

Defense Clean L white-box L2 white-box PGD10 PGD20 PGD30 PGD50 PGD70 PGD100 PGD10 PGD20 PGD30 PGD50 PGD100

Diff Wave 97 89 89 89 85 84 84 89 86 83 85 84 Diff Spec 99 92 84 78 75 72 71 74 62 58 54 49

Table 5: Certified accuracy for different methods. For each noise level σ, we add the same level of noise to train the classifier and apply it to RS-Gaussian.

Method Noise level Certified radius (L2) 0 0.25 0.50 0.75 1.0 1.25 1.50

RS-Vanilla σ = 0.5 30 21 12 6 4 3 3 σ = 1.0 8 8 8 7 4 3 3

RS-Gaussian σ = 0.5 49 39 33 23 14 6 3 σ = 1.0 18 15 11 10 5 5 4

Audio Pure σ = 0.5 45 40 35 27 21 17 13 σ = 1.0 27 22 16 15 12 11 8

that the Diff Wave consistently outperforms Diff Spec against L2 and L -based adversarial examples. Moreover, compared with Diff Wave, despite Diff Spec achieve higher clean accuracy, it only achieves 49% robust accuracy, a significant 35% performance drop against L2-based adversarial examples. We think the potential reason is that the short-time Fourier transform (STFT) is an operation of information compression. The spectrogram contains much less information than the raw audio waveform. This experiment shows that the domain difference contributes to significantly different results, and directly applying the method from the image domain can lead to suboptimal performance for audio. It also verifies the crucial design of Audio Pure for adversarial robustness.

4.4 CERTIFIED ROBUSTNESS

In this section, we evaluate the certified robustness of Audio Pure via randomized smoothing(Cohen et al., 2019). Here we draw N = 100, 000 noise samples and select noise levels σ {0.5, 1.0} for certification. Note that we follow the same setting from Carlini et al. (2022) and choose to use the one-shot denoising method. The detailed implementation of our method could be found in Appendix C. We compare our results with randomized smoothing using the vanilla classifier and Gaussian augmented classifier, denoted RS-Vanilla and RS-Gaussian respectively. The results are shown in Table 5. We also provide the certified robustness under different L2 perturbation budget with different Gaussian noise σ = {0.5, 1.0} in Figure A of Appendix C. By observing our results, we find that our method outperforms baselines for a better certified accuracy except σ = 0.5 at 0 radius. We also notice that the performance of our method will be even better when the input noise gets larger. This may be due to Audio Pure can still recover the clean audio with a large L2-based perturbation while Gaussian augmented model could even not be converged when training with such large noise.

5 CONCLUSION

In this paper, we propose an adversarial purification-based defense pipeline for acoustic systems. To evaluate the effectiveness of Audio Pure , we design the adaptive attack method and evaluate our method among adaptive attacks, EOT attacks, and black-box attacks. Comprehensive experiments indicate that our defense is more effective than existing methods (including adversarial training) among the diverse type of adversarial examples. We show Audio Pure achieves better certifiable robustness via Randomized Smoothing than other baselines. Moreover, our defense can be a universal plug-and-play method for classifiers with different architectures.

Limitations. Audio Pure introduces the diffusion model, which increases the time and computational cost. Thus, how to improve time and computational efficiency is an important future work. For example, it is interesting to investigate the distillation technique (Salimans & Ho, 2022) and fast sampling method (Kong & Ping, 2021) to reduce the computation complexity introduced by diffusion models.

Published as a conference paper at ICLR 2023

ACKNOWLEDGMENT

We thank Prof. Xiaolin Huang from Shanghai Jiao Tong University for the valuable discussions. Shutong Wu is partially supported by the National Natural Science Foundation of China (61977046), Shanghai Science and Technology Program (22511105600), and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).

ETHICS STATEMENT

Our work proposes a defense pipeline for protecting acoustic systems from adversarial audio examples. In particular, our study focuses on speech command recognition, which is closely related to keyword spotting systems. Such systems are well known to be vulnerable to adversarial attacks. Our pipeline will enhance the security aspect of such real-world acoustic systems and benefit the social beings. The Speech Commands dataset used in our study are released by others and has been publicly available for years. The dataset contains various voices from anonymous speakers. To the best of our knowledge, it does not contain any privacy-related information for these speakers.

Hadi Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Kevin Warren, Anurag Swarnim Yadav, Tom Shrimpton, and Patrick Traynor. Hear no evil , see kenansville *: Efficient and transferable black-box attacks on speech recognition and voice identification systems. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 712 729. IEEE, 2021.

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-toend speech recognition in english and mandarin. In International conference on machine learning, pp. 173 182. PMLR, 2016.

Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square attack: a query-efficient black-box adversarial attack via random search. In European Conference on Computer Vision, pp. 484 501. Springer, 2020.

Yulong Cao, Chaowei Xiao, Benjamin Cyr, Yimeng Zhou, Won Park, Sara Rampazzi, Qi Alfred Chen, Kevin Fu, and Z Morley Mao. Adversarial sensor attack on lidar-based perception in autonomous driving. CCS, 2019a.

Yulong Cao, Chaowei Xiao, Dawei Yang, Jing Fang, Ruigang Yang, Mingyan Liu, and Bo Li. Adversarial objects against lidar-based autonomous driving systems. ar Xiv preprint ar Xiv:1907.05418, 2019b.

Yulong Cao, Chaowei Xiao, Anima Anandkumar, Danfei Xu, and Marco Pavone. Advdo: Realistic adversarial attacks for trajectory prediction. In European Conference on Computer Vision, pp. 36 52. Springer, 2022a.

Yulong Cao, Danfei Xu, Xinshuo Weng, Zhuoqing Mao, Anima Anandkumar, Chaowei Xiao, and Marco Pavone. Robust trajectory prediction against adversarial attacks. CORL (Oral), 2022b.

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39 57. Ieee, 2017.

Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-totext. In 2018 IEEE security and privacy workshops (SPW), pp. 1 7. IEEE, 2018.

Nicholas Carlini, Florian Tramer, J Zico Kolter, et al. (certified!!) adversarial robustness for free! ar Xiv preprint ar Xiv:2206.10550, 2022.

Guangke Chen, Sen Chenb, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. Who is real bob? adversarial attacks on speaker recognition systems. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 694 711. IEEE, 2021a.

Published as a conference paper at ICLR 2023

Guangke Chen, Zhe Zhao, Fu Song, Sen Chen, Lingling Fan, Feng Wang, and Jiashui Wang. Towards understanding and mitigating audio adversarial examples for speaker recognition. IEEE Transactions on Dependable and Secure Computing, 2022.

Guoguo Chen, Carolina Parada, and Georg Heigold. Small-footprint keyword spotting using deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4087 4091. IEEE, 2014.

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021b.

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310 1320. PMLR, 2019.

Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pp. 2206 2216. PMLR, 2020.

Wei Dai, Chia Dai, Shuhui Qu, Juncheng Li, and Samarjit Das. Very deep convolutional neural networks for raw waveforms. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 421 425. IEEE, 2017.

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780 8794, 2021.

Chris Donahue, Julian Mc Auley, and Miller Puckette. Adversarial audio synthesis. In International Conference on Learning Representations, 2018.

Tianyu Du, Shouling Ji, Jinfeng Li, Qinchen Gu, Ting Wang, and Raheem Beyah. Sirenattack: Generating adversarial audio for end-to-end acoustic systems. In Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, pp. 357 369, 2020.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.

Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. ar Xiv preprint ar Xiv:1412.5068, 2014.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020.

Judy Hoffman, Daniel A Roberts, and Sho Yaida. Robust learning with jacobian regularization. ar Xiv preprint ar Xiv:1908.02729, 2019.

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017.

Sonal Joshi, Saurabh Kataria, Yiwen Shao, Piotr Zelasko, Jesus Villalba, Sanjeev Khudanpur, and Najim Dehak. Defense against adversarial attacks on hybrid speech recognition using joint adversarial fine-tuning with denoiser. ar Xiv preprint ar Xiv:2204.03851, 2022.

Byeonggeun Kim, Mingu Lee, Jinkyu Lee, Yeonseok Kim, and Kyuwoong Hwang. Query-byexample on-device keyword spotting, 2019.

Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. ar Xiv preprint ar Xiv:2106.00132, 2021.

Published as a conference paper at ICLR 2023

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021.

Juncheng Li, Shuhui Qu, Xinjian Li, Joseph Szurley, J Zico Kolter, and Florian Metze. Adversarial music: Real world audio adversary against wake-word detection system. Advances in Neural Information Processing Systems, 32, 2019.

Xuechen Li, Ting-Kam Leonard Wong, Ricky TQ Chen, and David Duvenaud. Scalable gradients for stochastic differential equations. In International Conference on Artificial Intelligence and Statistics, pp. 3870 3882. PMLR, 2020.

Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27(8): 1256 1266, 2019.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.

Muzammal Naseer, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Fatih Porikli. A selfsupervised approach for adversarial robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 262 271, 2020.

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162 8171. PMLR, 2021.

Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. Diffusion models for adversarial purification. In International conference on machine learning. PMLR, 2022.

Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In International conference on machine learning, pp. 5231 5240. PMLR, 2019.

Krishan Rajaratnam and Basemah Alshemali. Speech coding and audio preprocessing for mitigating and detecting audio adversarial examples on automatic speech recognition. http://cs.uccs. edu/ jkalita/work/reu/REU2018/07Rajaratnam.pdf, 2018.

Mirco Ravanelli and Yoshua Bengio. Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 1021 1028. IEEE, 2018.

Mirco Ravanelli, Titouan Parcollet, and Yoshua Bengio. The pytorch-kaldi speech recognition toolkit. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6465 6469. IEEE, 2019.

Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verification using adapted gaussian mixture models. Digital signal processing, 10(1-3):19 41, 2000.

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. ar Xiv preprint ar Xiv:2202.00512, 2022.

Hadi Salman, Mingjie Sun, Greg Yang, Ashish Kapoor, and J Zico Kolter. Denoised smoothing: A provable defense for pretrained classifiers. Advances in Neural Information Processing Systems, 33:21945 21957, 2020.

Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. In International Conference on Learning Representations, 2018.

Simo S arkk a and Arno Solin. Applied Stochastic Differential Equations, volume 10. Cambridge University Press, 2019.

Published as a conference paper at ICLR 2023

Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019.

Changhao Shan, Junbo Zhang, Yujun Wang, and Lei Xie. Attention-based end-to-end models for small-footprint keyword spotting. Proc. Interspeech 2018, pp. 2037 2041, 2018.

Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, et al. Lingvo: a modular and scalable framework for sequence-to-sequence modeling. ar Xiv preprint ar Xiv:1902.08295, 2019.

Changhao Shi, Chester Holtz, and Gal Mishne. Online adversarial purification based on selfsupervised learning. In International Conference on Learning Representations, 2021.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.

David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. Xvectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5329 5333. IEEE, 2018.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021a.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of scorebased diffusion models. Advances in Neural Information Processing Systems, 34:1415 1428, 2021b.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021c.

Jiachen Sun, Yulong Cao, Christopher B Choy, Zhiding Yu, Anima Anandkumar, Zhuoqing Morley Mao, and Chaowei Xiao. Adversarially robust 3d point cloud recognition using self-supervisions. Neur IPS, 34:15498 15512, 2021.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.

Florian Tramer and Dan Boneh. Adversarial training and robustness for multiple perturbations. Advances in Neural Information Processing Systems, 32, 2019.

Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. Advances in Neural Information Processing Systems, 33:1633 1645, 2020.

Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. ar Xiv preprint ar Xiv:1804.03209, 2018.

Eric Wong, Leslie Rice, and J Zico Kolter. Fast is better than free: Revisiting adversarial training. In International Conference on Learning Representations, 2020.

Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks. In IJCAI, 2018a.

Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, and Dawn Song. Spatially transformed adversarial examples. In International Conference on Learning Representations, 2018b.

Chaowei Xiao, Dawei Yang, Bo Li, Jia Deng, and Mingyan Liu. Meshadv: Adversarial meshes for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6898 6907, 2019.

Published as a conference paper at ICLR 2023

Chaowei Xiao, Zhongzhu Chen, Kun Jin, Jiongxiao Wang, Weili Nie, Mingyan Liu, Anima Anandkumar, Bo Li, and Dawn Song. Densepure: Understanding diffusion models towards adversarial robustness. ar Xiv preprint ar Xiv:2211.00322, 2022a.

Chaowei Xiao, Xinlei Pan, Warren He, Jian Peng, Mingjie Sun, Jinfeng Yi, Mingyan Liu, Bo Li, and Dawn Song. Characterizing attacks on deep reinforcement learning. AAMAS, 2022b.

Saining Xie, Ross Girshick, Piotr Doll ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492 1500, 2017.

Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. Characterizing audio adversarial examples using temporal dependency. In International Conference on Learning Representations, 2019.

Jongmin Yoon, Sung Ju Hwang, and Juho Lee. Adversarial purification with score-based generative models. In International Conference on Machine Learning, pp. 12062 12072. PMLR, 2021.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016.

Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate once: Accelerating adversarial training via maximal principle. Advances in Neural Information Processing Systems, 32, 2019a.

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pp. 7472 7482. PMLR, 2019b.

Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Duane Boning, and Cho-Jui Hsieh. Towards stable and efficient training of verifiably robust neural networks. ICLR, 2019c.

Published as a conference paper at ICLR 2023

A DETAILS ON TRAINING THE IMPROVE DDPM.

We train an Improved DDPM using the official repository (https://github.com/openai/improveddiffusion). For the UNet model, we set image size = 32, num channels = 3, and num res blocks = 128. For diffusion flags, we set N = 200, β1 = 0.0001, βN = 0.02 and use the linear variance schedule. For the model training, we set the learning rate to 1e 4 and the batch size to 230. The training loss has converged after 80,000 training steps, and we use this checkpoint to build our purifier.

B ADDITIONAL EXPERIMENTS OF TRANSFER-BASED ATTACK

We additionally evaluate our method under transfer-based attack, where we assume the attacker can only get the output logits of the acoustic system but have no knowledge about the used defense.

We use model functional stealing to train a surrogate model. Specifically, we first feed input examples into the acoustic system consisting of Diff Wave and a Res Ne Xt classifier and get the output logits. Then we use these output logits of the acoustic system as labels and train a new surrogate Res Ne Xt model, which has the same architecture as the classifier in the acoustic system. The results are shown in Table A. The Stealing Acc. denotes the accuracy of the surrogate classifier using the predictions of the defended acoustic system as ground truth. The Transfer to Vanilla and Transfer to Defended represent the undefended vanilla classifier and the defended acoustic system. The surrogate classifier is attacked to generate adversarial examples, and these adversarial examples are transferred to evaluate the robustness of the undefended vanilla classifier and the defended acoustic system.

Table A: Transfer-based attack via model functional stealing. We train a surrogate model, using the outputs of the defended acoustic system as labels. Then adversarial examples are generated by attacking the surrogate model and transferred to the undefended vanilla classifier and the defended acoustic system.

Stealing Target Stealing Acc. Transfer to Vanilla Transfer to Defended Clean Robust Clean Robust

Audio Pure (n = 1) 100 100 22 100 99 Audio Pure (n = 5) 98 100 58 96 94

C DETAILS ABOUT CERTIFIED ROBUSTNESS

Randomized smoothing (Cohen et al., 2019) provides a provable robustness guarantee in L2-norm by evaluating models under noise. Usually, the performance of the vanilla classifier will degrade when feeding the Gaussian perturbed inputs. To alleviate this problem, we can re-train a new network or fine-tune a pretrained network on Gaussian augmented data. However, both of them could take a lot of time on training. Another way is to apply a denoiser before the vanilla classifier, named denoised smoothing (Salman et al., 2020). Since the reverse process of the diffusion model can be seen as a good denoiser, we can use a pretrained diffusion model as a plug-and-play method to make any model certifiably robust. For a given noise level σ, we can compute the corresponding diffusion step t which adds the same level of noise to the input examples. The diffusion process can be reformulated as:

xn = αnx0 +

1 αtz = αn(x0 + r

αn z), z N(0, I), (S17)

while the noisy input ˆx of randomized smoothing is

ˆx = x0 + σz, z N(0, I). (S18)

Published as a conference paper at ICLR 2023

So we can obtain n s.t. 1 αn

αn = σ after multiplying a rescale coefficient αn on the input ˆx.

According to Carlini et al. (2022), a single reverse step is able to recover an image with a high accuracy for the classifier and can largely save computational time by directly recovering the data through x0 = 1 αn (xn 1 αnϵθ( αnˆx, n)). So we can just apply one-shot denoising instead of running full steps in our reverse process.

Figure A shows the certified accuracy of Audio Pure compared with RS-Gaussian and RS-Vanilla. The results show that the certified robustness of our method is consistently better than baselines except at small certified radii when σ = 0.50.

0 1 2 3 4 Certified Radius

Certified Accuracy

sigma=0.5, N=100000

Audio Pure RS-Gaussian RS-Vanilla

0 1 2 3 4 Certified Radius

Certified Accuracy

sigma=1.0, N=100000

Audio Pure RS-Gaussian RS-Vanilla

Figure A: Certified robustness (L2) with different input noise level σ. Larger σ ensures better robustness under larger perturbations, but the performance for benign inputs will be degraded.

D THEORETICAL ANALYSIS ON THE PURIFICATION ABILITY

Theorem D. 1. Assume that p(x) and q(x) are respectively the data distribution of clean examples and the data distribution of adversarial examples. We use pt and qt to represent the respective distribution of x(t) when x(t) p(x) and x(t) when x(t) q(x). Then we have

DKL(pt||qt)

where the equality is established only if pt = qt. This inequality indicates that as t increases from 0 to 1, the KL divergence of pt and qt monotonically decreases. In other words, when the diffusion steps n increases, more of the adversarial perturbations will be removed. Considering that the original semantic information will also be removed if n is too large, which affects the clean accuracy, there should be a trade-off when we set n for the diffusion model purifier.

Proof: Following Nie et al. (2022); Song et al. (2021b), we firstly formulate the Fokker-Planck equation (S arkk a & Solin, 2019) of the forward SDE in Eq. 7 (where we define f(x, t) := 1

2β(t) and g(t) := p

f(x, t)pt(x) 1

2g2(t) xpt(x)

f(x, t)pt(x) 1

2g2(t)pt(x) x log pt(x)

= x (hp(x, t)pt(x))

where hp(x, t) := 1

2g2(t) x log pt(x) f(x, t). Assuming pt and qt are smooth and fast decaying, i.e. for any i = 1, . . . , d, we have

lim xi pt(x)

xi log pt(x) = 0, lim xi qt(x)

xi log qt(x) = 0 (S21)

Published as a conference paper at ICLR 2023

for xi, the i-th dimension of x Rd. Then we reformulate the KL divergence as DKL(pt||qt)

Z pt(x) log pt(x)

f(x, t)pt(x) 1

2g2(t)pt(x) x log pt(x)

= Z x (hp(x, t)pt(x)) log pt(x)

qt(x)dx + Z pt(x)

qt(x) x (hp(x, t)pt(x))dx

= Z pt(x)[hp(x, t) hq(x, t)] [ x log pt(x) x log qt(x)]dx

2g2(t) Z pt(x) x log pt(x) x log qt(x) 2 2dx

2g2(t)DF (pt||qt)

where DF (pt||qt) is the Fisher divergence. Considering that g2(t) = β(t) > 0, and the Fisher divergence DF (pt||qt) 0 and the equality is established only if pt = qt, as a result, we have Eq S19, where the equality is established only if pt = qt.

E EXPERIMENTS ON THE QUALCOMM KEYWORD SPEECH DATASET

In addition to the commonly used SC09, for a more comprehensive consideration, we also conduct experiments on the Qualcomm Keyword Speech Dataset (Kim et al., 2019), denoted as QKW in the following. QKW consists of 4270 utterances belonging to four classes, with variable durations from 0.48s to 1.92s. We split them into a training set (3770 utterances), a validation set (400 utterances), and a test set (100 utterances). To handle the variable-sized input, we train an Attention Recurrent Convolutional Network (Shan et al., 2018) and save the checkpoint with the highest accuracy on the validation set. Then we finetuned the Diff Wave model on QKW for 50,000 steps, with lr = 2e 4 and batch size per gpu = 2 for 3 GPU. The results under L PGD10 with ϵ = 0.002 are shown in Table B. We can observe that Audio Pure can still achieve non-trivial robustness and handle the audio with variable time duration well.

Table B: We apply Audio Pure to the Qualcomm Keyword Speech Dataset. The diffusion steps n is set to 2.

Defense Clean Robust

None 100 0 Audio Pure 91 61

F FINE-TUNING ON ADVERSARIAL EXAMPLES

Audio Pure takes advantage of pretrained diffusion models. We wonder whether the purification performance will be improved if fine-tuned on adversarial examples. And we further fine-tune the Diff Wave model by augmenting self-supervised perturbation (SSP) (Naseer et al., 2020). Specifically, we use STFT (rescaling to the Mel-scale) as our feature extractor and maximize the following objective to generate perturbed examples: arg max x (x, x ) = STFT(x), STFT(x ) , s.t. x x (S23)

where x is the clean example and x is the perturbed example. We then use gradient descent to optimize the perturbed example by: x t+1 = clip(x t + α sign( x (x, x t)), x ϵ, x + ϵ), (S24) for t = 1, . . . , T. Here we use T = 100, ϵ = 0.002 and α = 0.0004. Next, we fine-tune the pretrained Diff Wave model on the SSP examples, minimizing the following loss: Ltuning = Laudio + λLfeat, (S25)

Published as a conference paper at ICLR 2023

Laudio = MSE(x, Purifier(x t, n )), (S26)

Lfeat = MSE(STFT(x), STFT(Purifier(x t, n )). (S27)

We choose λ = 0.1 and use SGD to optimize Ltuning, setting the learning rate to 1e 5. The results are shown in Table C. As a result, it does not improve the performance of Audio Pure (with n = 3) under L PGD10 and PGD70 with ϵ = 0.002. These results further verify the effectiveness of using pretrained models.

Table C: We fine-tune the pretrained Diff Wave model on adversarial examples generated by SSP. After fine-tuning, the performance is not improved.

Defense Clean PGD10 PGD70

None 100 3 1 Audio Pure 97 89 84 SSP-Tuned Audio Pure 97 89 82

G COMPARISON WITH OTHER DENOISER-BASED DEFENSE

We compare Audio Pure with Defense GAN (Samangouei et al., 2018) and Joint Adversarial Finetuning (Joshi et al., 2022). For Defense GAN, which is originally designed to defend against adversarial images by finding the optimal noise that generates the most similar image to the adversarial counterpart, we adopt it to the audio domain, choosing Wave GAN (Donahue et al., 2018) as the GAN model in this pipeline. We train a Wave GAN on the SC09 dataset for 100 epochs, using the Adam optimizer with lr = 1e 3, β1 = 0.5, and β2 = 0.9. For Joint Adversarial Fine-tuning, we follow the setting of Joshi et al. (2022), using a Conv-Tas Net (Luo & Mesgarani, 2019) as the denoiser. And like Joshi et al. (2022), we craft an offline adversarial SC09 dataset against the pretrained classifier by using L-inf PGD-100 attacks with ϵ = 0.002 (denoted as Off Adv-SC09). Then we train a Conv-Tas Net model on Off Adv-SC09 for 30 epochs to get the pretrained denoiser. We denote the defense using the pretrained Conv-Tas Net as CTN Baseline. Based on the adversarial examples generated by attacking the whole acoustic system, we only update the Conv-Tas Net denoiser while keeping the classifier frozen, and denote this method as CTN Adv-Finetune-Joint-frozen. During the adversarial tuning, we use L PGD10 attack with ϵ = 0.002. After tuning for 1000 steps with batch size = 20, we calculate the clean and robust accuracy (under L PGD10 and PGD70 with ϵ = 0.002) on the same test used in our paper.

We report the results in Table D. We find that Defense GAN based on Wave GAN cannot work well in the audio domain. It shows the impact of domain differences with respect to the final results and verifies the importance of our pipeline design. Besides, the Conv-Tas Net denoiser is less effective than diffusion models against adaptive attacks, even after fine-tuning.

Table D: We compare Audio Pure with different denoiser-based defenses. Diff Wave is proven to be a more effective purifier.

Defense Clean PGD10 PGD70

None 100 3 1 Audio Pure 97 89 84 Defense GAN 8 0 0 CTN Baseline 98 13 1 CTN Adv-Finetune-Joint-frozen 90 52 41

Published as a conference paper at ICLR 2023

H COMPARISON WITH THE REGULARIZATION-BASED DEFENSE

Gu & Rigazio (2014); Hoffman et al. (2019) introduce the input-output Jacobian matrix of the network as a regularization term in the optimization objective, formulated as

L(xi, yi) + λ f(xi)

where xi Rd is the input data, yi Rn is the label, L : Rd Rn R is the original loss function, and f : Rd Rn is the neural network. By minimizing the Frobenius norm of the Jacobian matrix, the adversarial robustness of the network will be improved. For a more comprehensive study, we also compare Audio Pure with this regularization-based method, using different λ. The results are shown in Table E, where we denote the regularization-based defense as Jacobian-Reg.

Table E: We compare Audio Pure with the regularization-based defense, using different λ.

Defense Clean PGD10 PGD70

None 100 3 1 Audio Pure 97 89 84 Jacobian-Reg (λ=1e-8) 45 9 5 Jacobian-Reg (λ=1e-9) 84 27 15 Jacobian-Reg (λ=1e-10) 91 31 18 Jacobian-Reg (λ=1e-11) 96 19 4

I EXPERIMENTS ON LARGER ATTACK BUDGETS.

Besides the results of different ϵ in Table 2, we conduct additional experiments to explore the potential of the diffusion model for purification. We select ϵ = {0.01, 0.02, 0.03, 0.04, 0.05}, and set the diffusion steps n . The results are shown in Table F. We find that our method still achieves 42% accuracy at ϵ = 0.03, which brings significant distortions to audio. Our method keeps the ability to purify adversarial perturbations until ϵ = 0.05. We also visualize the audio waveforms under attacks with different ϵ, illustrated in Figure B. It is easy to observe significant noise in them.

Table F: We explore the potential of Diff Wave under larger attack budgets. The diffusion steps n is set to 5.

Attack Budget ϵ = 0.01 ϵ = 0.02 ϵ = 0.03 ϵ = 0.04 ϵ = 0.05

Robust Acc. 82 67 42 14 0

J ADDITIONAL INFERENCE TIME COST.

Due to the introduction of diffusion models, Audio Pure will bring additional time cost during inference. As shown in Table G, we compute the time cost per audio, averaged on 100 examples and the time duration for each example is around one second. We evaluate it on an NVIDIA RTX 3090 GPU with Intel Core i9-10920X CPU @ 3.50GHz and 64 GB RAM.

Table G: The inference time cost when using different diffusion steps n .

Diffusion Steps n = 0 n = 1 n = 2 n = 3 n = 5 n = 7 n = 10

Time Cost (s) 0.0967 0.5522 0.7876 1.0162 1.4795 2.0125 2.6839

Published as a conference paper at ICLR 2023

Figure B: Visualizations of the clean audio and adversarial audio with different attack budgets.

Published as a conference paper at ICLR 2023

Figure B: Visualizations of the clean audio and adversarial audio with different attack budgets.