# diffusion_models_for_adversarial_purification__6d822e10.pdf

Diffusion Models for Adversarial Purification

Weili Nie 1 Brandon Guo 2 Yujia Huang 2 Chaowei Xiao 1 3 Arash Vahdat 1 Anima Anandkumar 1 2

Adversarial purification refers to a class of defense methods that remove adversarial perturbations using a generative model. These methods do not make assumptions on the form of attack and the classification model, and thus can defend pre-existing classifiers against unseen threats. However, their performance currently falls behind adversarial training methods. In this work, we propose Diff Pure that uses diffusion models for adversarial purification: Given an adversarial example, we first diffuse it with a small amount of noise following a forward diffusion process, and then recover the clean image through a reverse generative process. To evaluate our method against strong adaptive attacks in an efficient and scalable way, we propose to use the adjoint method to compute full gradients of the reverse generative process. Extensive experiments on three image datasets including CIFAR10, Image Net and Celeb A-HQ with three classifier architectures including Res Net, Wide Res Net and Vi T demonstrate that our method achieves the state-of-the-art results, outperforming current adversarial training and adversarial purification methods, often by a large margin. Project page: https://diffpure.github.io.

1. Introduction

Neural networks are vulnerable to adversarial attacks: adding imperceptible perturbations to the input can mislead trained neural networks to predict incorrect classes (Szegedy et al., 2014; Goodfellow et al., 2015). There have been many works on defending neural networks against such adversarial attacks (Madry et al., 2018; Song et al., 2018; Gowal et al., 2020). Among them, adversarial training (Madry et al., 2018), which trains neural networks on adversarial examples, has become a standard defense form, due to its

1NVIDIA 2Caltech 3ASU. Correspondence to: Weili Nie <wnie@nvidia.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

effectiveness (Zhang et al., 2019; Gowal et al., 2021). However, most adversarial training methods can only defend against a specific attack that they are trained with. Recent works on defending against unseen threats add a carefully designed threat model into their adversarial training pipeline, but they suffer from a significant performance drop (Laidlaw et al., 2021; Dolatabadi et al., 2021). Additionally, the computational complexity of adversarial training is usually higher than standard training (Wong et al., 2020).

In contrast, another class of defense methods, often termed adversarial purification (Shi et al., 2021; Yoon et al., 2021), relies on generative models to purify adversarially perturbed images before classification (Samangouei et al., 2018; Hill et al., 2021). Compared to the adversarial training methods, adversarial purification can defend against unseen threats in a plug-n-play manner without re-training the classifiers. This is because the generative purification models are trained independently from both threat models and classifiers. Despite these advantages, their performance usually falls behind current adversarial training methods (Croce & Hein, 2020), in particular against adaptive attacks where the attacker has the full knowledge of the defense method (Athalye et al., 2018; Tramer et al., 2020). This is usually attributed to the shortcomings of current generative models that are used as a purification model, such as mode collapse in GANs (Goodfellow et al., 2014), low sample quality in energy-based models (EBMs) (Le Cun et al., 2006), and the lack of proper randomness (Pinot et al., 2020).

Recently, diffusion models have emerged as powerful generative models (Ho et al., 2020; Song et al., 2021b). These models have demonstrated strong sample quality, beating GANs in image generation (Dhariwal & Nichol, 2021; Vahdat et al., 2021). They have also exhibited strong mode coverage, indicated by high test likelihood (Song et al., 2021a). Diffusion models consist of two processes: (i) a forward diffusion process that converts data to noise by gradually adding noise to the input, and (ii) a reverse generative process that starts from noise and generates data by denoising one step at a time. Intuitively in the generative process, diffusion models purify noisy samples, playing a similar role of a purification model. Their good generation quality and diversity ensure the purified images closely follow the original distribution of clean data. Moreover, the stochasticity in diffusion models can make a powerful stochastic defense (He

Diffusion Models for Adversarial Purification

et al., 2019). These properties make diffusion models an ideal candidate for generative adversarial purification.

We summarize our main contributions as follows:

We propose Diff Pure, the first adversarial purification method that uses the forward and reverse processes of pre-trained diffusion models to purify adversarial images.

We provide a theoretical analysis of the amount of noise added in the forward process such that it removes adversarial perturbations without destroying label semantics.

We propose to use the adjoint method to efficiently compute full gradients of the reverse generative process in our method for evaluation against strong adaptive attacks.

We perform extensive experiments to demonstrate that our method achieves the new start-of-the-art on various adaptive attack benchmarks.

In this work, we propose a new adversarial purification method, termed Diff Pure, that uses the forward and reverse processes of diffusion models to purify adversarial images, as illutrated in Figure 1. Specifically, given a pre-trained diffusion model, our method consists of two steps: (i) we first add noise to adversarial examples by following the forward process with a small diffusion timestep, and (ii) we then solve the reverse stochastic differential equation (SDE) to recover clean images from the diffused adversarial examples. An important design parameter in our method is the choice of diffusion timestep, since it represents the amount of noise added during the forward process. Our theoretical analysis reveals that the noise needs to be high enough to remove adversarial perturbations but not too large as it will destroy the label semantics of purified images. Furthermore, strong adaptive attacks require gradient backpropagation through the SDE solver in our method, which suffers from the memory issue if implemented naively. Thus, we propose to use the adjoint method to efficiently calculate full gradients of the reverse SDE with a constant memory cost.

We empirically compare our method against the latest adversarial training and adversarial purification methods on various strong adaptive attack benchmarks. Extensive experiments on three datasets (i.e., CIFAR-10, Image Net and Celeb A-HQ) across multiple classifier architectures (i.e., Res Net, Wide Res Net and Vi T) demonstrate the state-of-theart performance of our method. For instance, compared to adversarial training methods against Auto Attack ℓ (Croce & Hein, 2020), our method shows absolute improvements of up to +5.44% on CIFAR-10 and up to +7.68% on Image Net, respectively, in robust accuracy. Moreover, compared to the latest adversarial training methods against unseen threats, our method exhibits a more significant absolute improvement (up to +36% in robust accuracy). In comparison to adversarial purification methods against the BPDA+EOT attack (Hill et al., 2021), we have absolute improvements

Forward SDE

Adversarial image

Purified image

Panda Diffused image

M4C2xl RHJp Fbyr+53VSDK/8TKgk Ra7Yf FGYSo Ixmf5N+k Jzhn Js CWVa2Fs JG1JNGdp0Sj YEb/Hl Zd I8q3o XVf+v FK7ye Mowh Ecwyl4c Ak1u IM6NIDBAJ7h Fd4c6bw4787Hv LXg5DOH8Af O5w/SAY1+</latexit>t = 0 t = t

Panda Adversarial

Adversarial attack (Backpropagation through SDE)

M4C2xl RHJp Fbyr+53VSDK/8TKgk Ra7Yf FGYSo Ixmf5N+k Jzhn Js CWVa2Fs JG1JNGdp0Sj YEb/Hl Zd I8q3o XVf+v FK7ye Mowh Ecwyl4c Ak1u IM6NIDBAJ7h Fd4c6bw4787Hv LXg5DOH8Af O5w/SAY1+</latexit>t = 0 Reverse SDE

Figure 1. An illustration of Diff Pure. Given a pre-trained diffusion model, we add noise to adversarial images following the forward diffusion process with a small diffusion timestep t to get diffused images, from which we recover clean images through the reverse denoising process before classification. Adaptive attacks backpropagate through the SDE to get full gradients of our defense system.

of +11.31% on CIFAR-10 and +15.63% on Celeb A-HQ, respectively, in robust accuracy. Finally, our ablation studies show the optimal diffusion timestep remains small for most attacks, confirm the importance of proper noise injection in the forward and reverse processes for adversarial robustness, and also demonstrate the better performance by combining Diff Pure with existing adversarial training methods.

2. Background

In this section, we briefly review continuous-time diffusion models (Song et al., 2021b).

Denote by p(x) the unknown data distribution, from which each data pointx Rd is sampled. Diffusion models diffuse p(x) towards a noise distribution. The forward diffusion process {x(t)}t [0,1] is defined by an SDE with positive time increments in a fixed time horizon [0, 1]:

dx = f (x, t)dt + g(t)dw, (1)

where the initial value x(0) := x p(x), f : Rd R Rd

is the drift coefficient, g : R R is the diffusion coefficient, and w(t) Rd is a standard Wiener process.

Denote by pt(x) the marginal distribution of x(t) with p0(x) := p(x). In particular, f (x, t) and g(t) can be properly designed such that at the end the diffusion process, x(1) follows the standard Gaussian distribution, i.e., p1(x) N(0,I d). Throughout the paper, we consider VPSDE (Song et al., 2021b) as our diffusion model, where f (x, t) := 1

2β(t)x and g(t) := p

β(t), with β(t) representing a time-dependent noise scale. By default, we use the linear noise schedule, i.e., β(t) := βmin + (βmax βmin)t.

Sample generation is done using the reverse-time SDE:

dˆx = [f (ˆx, t) g(t)2 ˆx log pt(ˆx)]dt + g(t)d w (2)

Diffusion Models for Adversarial Purification

where dt is an infinitesimal negative time step, and w(t) is a standard reverse-time Wiener process. Sampling ˆx(1) N(0,I d) as the initial value and solving the above SDE from t=1 to t=0 gradually produce the less-noisy data ˆx(t) until we draw samples from the data distribution, i.e., ˆx(0) p0(x). Ideally, the resulting denoising process {ˆx(t)}t [0,1] from Eq. (2) has the same distribution as the forward process {x(t)}t [0,1] obtained from Eq. (1).

The reverse-time SDE in Eq. (2) requires the knowledge of the time-dependent score function x log pt(x). One popular approach is to estimate x log pt(x) with a parameterized neural network sθ(x, t) (Song et al., 2021b; Kingma et al., 2021). Accordingly, diffusion models are trained with the weighted combination of denoising score matching (DSM) across multiple time steps (Vincent, 2011):

0 Ep(x)p0t( x|x) λ(t) x log p0t( x|x) sθ( x, t) 2 2 dt

where λ(t) is the weighting coefficient, and p0t( x|x) is the transition probability from x(0) := x to x(t) := x that has a closed form through the forward SDE in Eq. (1).

We first propose diffusion purification (or Diff Pure for short) that adds noise to adversarial images following the forward process of diffusion models to get diffused images, from which clean images are recovered through the reverse process. We also introduce some theoretical justifications of our method (Section 3.1). Next, we apply the adjoint method to backpropagate through SDE for efficient gradient evaluation with strong adaptive attacks (Section 3.2).

3.1. Diffusion Purification

Since the role of the forward SDE in Eq. (1) is to gradually remove the local structures of data by adding noise, we hypothesize that given an adversarial example xa, if we start the forward process with x(0) = xa, the adversarial perturbations, a form of small local structures added to the data, will also be gradually smoothed.

The following theorem confirms that the clean data distribution p(x) and the adversarially perturbed data distribution q(x) get closer over the forward diffusion process, implying that the adversarial perturbations will indeed be washed out by the increasingly added noise.

Theorem 3.1. Let {x(t)}t [0,1] be the diffusion process defined by the forward SDE in Eq. (1). If we denote by pt and qt the respective distributions of x(t) when x(0) p(x) (i.e., clean data distribution) and x(0) q(x) (i.e., adversarial sample distribution), we then have

DKL(pt||qt)

where the equality happens only when pt=qt. That is, the KL divergence of pt and qt monotonically decreases when moving from t=0 to t=1 through the forward SDE.

The proof follows (Song et al., 2021a; Lyu, 2009) that build connections between Fisher divergence and the rate of change in KL divergence by generalizing the de Bruijn s identity (Barron, 1986), which we defer to Appendix A.1. From the above theorem, there exists a minimum timestep t [0, 1] such that DKL(pt ||qt ) ε. However, the diffused adversarial sample x(t ) qt at timestep t=t

contains additional noise and cannot be directly classified. Hence, starting from x(t ), we can stochastically recover the clean data at t=0 through the SDE in Eq. (2).

Diffusion purification: Inspired by the observation above, we propose a two-step adversarial purification method using diffusion models: Given an adversarial example xa at timestep t=0, i.e., x(0) = xa, we first diffuse it by solving the forward SDE in Eq. (1) from t=0 to t=t . For VP-SDE, the diffused adversarial sample at the diffusion timestep t [0, 1] can be sampled efficiently using:

α(t )xa + p

1 α(t )ϵ (3)

where α(t) = e R t 0 β(s)ds and ϵ N(0,I d).

Second, we solve the reverse-time SDE in Eq. (2) from the timestep t=t using the diffused adversarial sample x(t ), given by Eq. (3), as the initial value to get the final solution ˆx(0) of SDE in Eq. (2). As ˆx(0) does not have a closedform solution, we resort to an SDE solver, termed sdeint (usually with the Euler Maruyama discretization (Kloeden & Platen, 1992)). That is, ˆx(0) = sdeint(x(t ),f rev, grev, w, t , 0) (4)

where sdeint is defined to sequentially take in six inputs: initial value, drift coefficient, diffusion coefficient, Wiener process, initial time, and end time. Also, the above drift and diffusion coefficients are given by

f rev(x, t) := 1

2β(t)[x + 2sθ(x, t)]

grev(t) := p

The resulting purified data ˆx(0) is then passed to an external standard classifier to make predictions. An illustration of our method is shown in Figure 1.

Choosing the diffusion timestep t : From Theorem 3.1, t should be large enough to remove local adversarial perturbations. However, t cannot be arbitrarily large because the global label semantics will also be removed by the diffusion process if t keeps increasing. As a result, the purified sample ˆx(0) cannot be classified correctly.

Formally, the following theorem characterizes how the diffusion timestep t affects the difference between the clean image x0 and purified image obtained by our method ˆx(0).

Diffusion Models for Adversarial Purification

Theorem 3.2. If we assume the score function satisfies that sθ(x, t) 1

2Cs, the L2 distance between the clean data x and the purified data ˆx(0) given by Eq. (4) satisfies that with a probability of at least 1 δ, we have

ˆx(0) x ϵa + p

e2γ(t ) 1Cδ + γ(t )Cs where ϵa denotes the adversarial perturbation satisfying xa = x+ϵa, γ(t ) := R t

0 1 2β(s)ds and the constant Cδ := r

δ + 4 log 1

See Appendix A.2 for the proof. Since γ(t ) monotonically increases with t and γ(t ) 0 for all t , the last two terms in the above upper bound both increase with t . Thus, to make ˆx(0) x as low as possible, t needs to be sufficiently small. In the extreme case where t =0, we have the equality that ˆx(0) x = ϵa , which means ˆx(0) reduces to xa if we do not perform diffusion purification.

Due to the trade-off between purifying the local perturbations (with a larger t ) and preserving the global structures (with a smaller t ) of adversarial examples, there exists a sweet spot for the diffusion timestep t to obtain a high robust classification accuracy. Since adversarial perturbations are usually small, which can be removed with a small t , the best t in most adversarial robustness tasks also remain relatively small. As a proof of concept, we provide visual examples in Figure 2 to show how our method purifies the adversarial perturbations while maintaining the global semantic structures. See Appendix C.5 for more results.

3.2. Adaptive Attack to Diffusion Purification

Strong adaptive attacks (Athalye et al., 2018; Tramer et al., 2020) require computing full gradients of our defense system. However, simply backpropagating through the SDE solver in Eq. (4) scales poorly in the computational memory. In particular, denote by N the number of function evaluations in solving the SDE, the required memory increases by O(N). This issue makes it challenging to effectively evaluate our method with strong adaptive attacks.

Prior adversarial purification methods (Shi et al., 2021; Yoon et al., 2021) suffer from the same memory issue with strong adaptive attacks. Thus, they either evaluate only with blackbox attacks or change the evaluation strategy to circumvent the full gradient computation (e.g., using approximate gradients). This makes them difficult to compare with adversarial training methods under the more standard evaluation protocols (e.g., Auto Attack). To overcome this, we propose to use the adjoint method (Li et al., 2020) to efficiently compute full gradients of the SDE without the memory issue. The intuition is that the gradient through an SDE can be obtained by solving another augmented SDE.

The following proposition provides the augmented SDE for

Adversarial t=0.3 t=0.15 t=0 Clean

(a) Smiling

Adversarial t=0.3 t=0.15 t=0 Clean

(b) Eyeglasses

Figure 2. Our method purifies adversarial examples (first column) produced by attacking attribute classifiers using PGD ℓ (ϵ = 16/255), where t = 0.3. The middle three columns show the results of the SDE in Eq. (4) at different timesteps, and we observe the purified images at t=0 match the clean images (last column). Better zoom in to see how we remove adversarial perturbations.

calculating the gradient of an objective L w.r.t. the input x(t ) of the SDE in Eq. (4).

Proposition 3.3. For the SDE in Eq. (4), the augmented SDE that computes the gradient L x(t ) of backpropagating through it is given by

, f , g, w, 0, t

where L ˆx(0) is the gradient of the objective L w.r.t. the output ˆx(0) of the SDE in Eq. (4), and

f ([x;z], t) =

f rev(x, t)

g(t) = grev(t)1d 0d

w(t) = w(1 t) w(1 t)

with 1d and 0d representing the d-dimensional vectors of all ones and all zeros, respectively.

The proof is deferred to Appendix A.3. Ideally if the SDE solver has a small numerical error, the gradient obtained from this proposition will closely match its true value (see Appendix B.5). As the gradient computation has been converted to solving the augmented SDE in Eq. (6), we do not need to store intermediate operations and thus end up with

Diffusion Models for Adversarial Purification

the O(1) memory cost (Li et al., 2020). That is, the adjoint method described above turns the reverse-time SDE in Eq. (4) into a differentiable operation (without the memory issue). Since the forward diffusion step in Eq. (3) is also differentiable using the reparameterization trick, we can easily compute full gradients of a loss function regarding the adversarial images for strong adaptive attacks.

4. Related Work

Adversarial training It learns a robust classifier by training on adversarial examples created during every weight update. After first introduced by Madry et al. (2018), adversarial training has become one of the most successful defense methods in neural networks against adversarial attacks (Gowal et al., 2020; Rebuffi et al., 2021). Despite the difference in the defense form, some variants of adversarial training share similarities with our method. He et al. (2019) inject Gaussian noise to each network layer for better robustness via stochastic effects. Kang et al. (2021) train neural ODEs with Lyapunov-stable equilibrium points for adversarial defense. Gowal et al. (2021) use generative models for data augmentation to improve adversarial training, where diffusion models work the best.

Adversarial purification Using generative models to purify adversarial images before classification, adversarial purification has become a promising counterpart of adversarial training. In particular, Samangouei et al. (2018) propose defense-GAN using GANs as the purification model, and Song et al. (2018) propose Pixel Defense by relying on autoregressive generative models. More recently, Du & Mordatch (2019); Grathwohl et al. (2020); Hill et al. (2021) show the improved robustness of using EBMs to purify attacked images via Langevin dynamics (LD). More similar to our work, Yoon et al. (2021) use the denoising score-based model (Song & Ermon, 2019) for purification, but its sampling is still a variant of LD that does not rely on forward diffusion and backward denoising processes. We empirically compare our method against these previous works and we largely outperform them.

Image editing with diffusion models As a probabilistic generative models for unsupervised modeling (Ho et al., 2020), diffusion models have shown strong sample quality and diversity in image synthesis (Dhariwal & Nichol, 2021; Song et al., 2021a). Since then, they have been used in many image editing tasks, such as image-to-image translation (Meng et al., 2021; Choi et al., 2021; Saharia et al., 2021) and text-guided image editing (Kim & Ye, 2021; Nichol et al., 2021; Ramesh et al., 2022). Although adversarial purification can be considered as a special image editing task and particularly Diff Pure shares a similar procedure with SDEdit (Meng et al., 2021), none of these works apply diffusion models to improve the model robustness. Besides,

evaluating our method with strong adaptive attacks poses a new challenge of backpropagating through the denoising process that previous works do not deal with.

5. Experiments

In this section, we first provide experimental settings (Section 5.1). On various strong adaptive attack benchmarks, we then compare our method with the state-of-the-art adversarial training and adversarial purification methods (Section 5.2 to 5.4). Note that we defer the results of our method against the standard attack (i.e., non-adaptive) and the black-box attack, suggested by Croce et al. (2022), to Appendix C.1 for completeness. Finally, we perform various ablation studies to provide better insights into our method (Section 5.5).

5.1. Experimental Settings

Datasets and network architectures We consider three datasets for evaluation: CIFAR-10 (Krizhevsky, 2009), Celeb A-HQ (Karras et al., 2018), and Image Net (Deng et al., 2009). Particularly, we compare with the state-of-the-art defense methods reported by the standardized benchmark Robust Bench (Croce et al., 2020) on CIFAR-10 and Image Net while comparing with other adversarial purification methods on CIFAR-10 and Celeb A-HQ following their settings. For classifiers, we consider three widely used architectures: Res Net (He et al., 2016), Wide Res Net (Zagoruyko & Komodakis, 2016) and Vi T (Dosovitskiy et al., 2021).

Adversarial attacks We evaluate our method with strong adaptive attacks. We use the commonly used Auto Attack ℓ and ℓ2 threat models (Croce & Hein, 2020) to compare with adversarial training methods. To show the broader applicability of our method beyond ℓp-norm attacks, we also evaluate with the spatially transformed adversarial examples (St Adv) (Xiao et al., 2018). Due to the stochasticity introduced by the diffusion and denoising processes (Section 3.1), we apply Expectation Over Time (EOT) (Athalye et al., 2018) to these adaptive attacks, where we use EOT=20 (see Figure 6 for more details). Besides, we apply the BPDA+EOT attack (Hill et al., 2021) to make a fair comparison with other adversarial purification methods.

Evaluation metrics We consider two metrics to evaluate the performance of defense approaches: standard accuracy and robust accuracy. The standard accuracy measures the performance of the defense method on clean data, which is evaluated on the whole test set in each dataset. The robust accuracy measures the performance on adversarial examples generated by adaptive attacks. Due to the high computational cost of applying adaptive attacks to our method, unless stated otherwise, we evaluate robust accuracy for our method and previous works on a fixed subset of 512 images randomly sampled from the test set. Notably, robust accura-

Diffusion Models for Adversarial Purification

Table 1. Standard accuracy and robust accuracy against Auto Attack ℓ (ϵ = 8/255) on CIFAR-10, obtained by different classifier architectures. In our method, the diffusion timestep is t = 0.1.

Method Extra Data Standard Acc Robust Acc Wide Res Net-28-10 (Zhang et al., 2020) 89.36 59.96 (Wu et al., 2020) 88.25 62.11 (Gowal et al., 2020) 89.48 62.70 (Wu et al., 2020) 85.36 59.18 (Rebuffi et al., 2021) 87.33 61.72 (Gowal et al., 2021) 87.50 65.24 Ours 89.02 0.21 70.64 0.39 Wide Res Net-70-16 (Gowal et al., 2020) 91.10 66.02 (Rebuffi et al., 2021) 92.23 68.56 (Gowal et al., 2020) 85.29 59.57 (Rebuffi et al., 2021) 88.54 64.46 (Gowal et al., 2021) 88.74 66.60 Ours 90.07 0.97 71.29 0.55

Table 2. Standard accuracy and robust accuracy against Auto Attack ℓ2 (ϵ = 0.5) on CIFAR-10, obtained by different classifier architectures. In our method, the diffusion timestep is t = 0.075. ( Methods use Wide Res Net-34-10, with the same width but more layers than the default one.)

Method Extra Data Standard Acc Robust Acc Wide Res Net-28-10 (Augustin et al., 2020) 92.23 77.93 (Rony et al., 2019) 89.05 66.41 (Ding et al., 2020) 88.02 67.77 (Wu et al., 2020) 88.51 72.85 (Sehwag et al., 2021) 90.31 75.39 (Rebuffi et al., 2021) 91.79 78.32 Ours 91.03 0.35 78.58 0.40 Wide Res Net-70-16 (Gowal et al., 2020) 94.74 79.88 (Rebuffi et al., 2021) 95.74 81.44 (Gowal et al., 2020) 90.90 74.03 (Rebuffi et al., 2021) 92.41 80.86 Ours 92.68 0.56 80.60 0.57

cies of most baselines do not change much on the sampled subset, compared to the whole test set (see Appendix C.2).

We defer more details of the above experimental settings and the baselines that we compare with to Appendix B.

5.2. Comparison with the State-of-the-art

We first compare Diff Pure with the state-of-the-art adversarial training methods reported by Robust Bench (Croce et al., 2020), against the ℓ and ℓ2 threat models, respectively.

CIFAR-10 Table 1 shows the robustness performance against ℓ threat model (ϵ = 8/255) with Auto Attack on CIFAR-10. We can see that our method achieves both better standard accuracy and better robust accuracy than previ-

Table 3. Standard accuracy and robust accuracy against Auto Attack ℓ (ϵ = 4/255) on Image Net, obtained by different classifier architectures. In our method, the diffusion timestep is t = 0.15. ( Robust accuracy is directly reported from the respective paper.)

Method Extra Data Standard Acc Robust Acc Res Net-50 (Engstrom et al., 2019) 62.56 31.06 (Wong et al., 2020) 55.62 26.95 (Salman et al., 2020) 64.02 37.89 (Bai et al., 2021) 67.38 35.51 Ours 67.79 0.43 40.93 1.96 Wide Res Net-50-2 (Salman et al., 2020) 68.46 39.25 Ours 71.16 0.75 44.39 0.95 Dei T-S (Bai et al., 2021) 66.50 35.50 Ours 73.63 0.62 43.18 1.27

ous state-of-the-art methods that do not use extra data on different classifier architectures. Specifically, our method improves robust accuracy by 5.44% on Wide Res Net-28-10 and by 4.69% on Wide Res Net-70-16, respectively. Furthermore, our method even largely outperforms baselines trained with extra data regarding robust accuracies, with comparable standard accuracies with different classifiers.

Table 2 shows the robustness performance against ℓ2 threat model (ϵ = 0.5) with Auto Attack on CIFAR-10. We can see that our method outperforms most defense methods without using extra data while being on par with the best performing method (Rebuffi et al., 2021), regarding both standard and robust accuracies. The gap between our method and (Rebuffi et al., 2021) trained with extra data exists, but can be leveled up by replacing the standard classifier in our method with the adversarially trained one, as shown in Figure 4.

These results demonstrate the effectiveness of our method in defending against ℓ and ℓ2 threat models on CIFAR10. It is worth noting that in contrast to the competing methods that are trained for the specific ℓp-norm attack used in evaluation, our method is agnostic to the threat model.

Image Net Table 3 shows the robustness performance against ℓ threat model (ϵ = 4/255) with Auto Attack on Image Net. We evaluate our method on two CNN architectures: Res Net-50 and Wide Res Net-50-2, and one Vi T architecture: Dei T-S (Touvron et al., 2021). We can see that our method largely outperforms the state-of-the-art baselines regarding both the standard and robust accuracies. Besides, the advantages of our method over baselines become more significant on the Vi T architecture. Specifically, our method improves robust accuracy by 3.04% and 5.14% on Res Net50 and Wide Res Net-50-2, respectively, and by 7.68% on Dei T-S. For standard accuracy on Dei T-S, our method also largely improves over the baseline by 7.13%.

Diffusion Models for Adversarial Purification

Table 4. Standard accuracy and robust accuracies against unseen threat models on Res Net-50 for CIFAR-10. We keep the same evaluation settings with (Laidlaw et al., 2021), where the attack bounds are ϵ = 8/255 for Auto Attack ℓ , ϵ = 1 for Auto Attack ℓ2, and ϵ = 0.05 for St Adv. The baseline results are reported from the respective papers. For our method, the diffusion timestep is t = 0.125.

Method Standard Acc Robust Acc ℓ ℓ2 St Adv Adv. Training with ℓ (Laidlaw et al., 2021) 86.8 49.0 19.2 4.8 Adv. Training with ℓ2 (Laidlaw et al., 2021) 85.0 39.5 47.8 7.8 Adv. Training with St Adv (Laidlaw et al., 2021) 86.2 0.1 0.2 53.9 PAT-self (Laidlaw et al., 2021) 82.4 30.2 34.9 46.4 ADV. CRAIG (Dolatabadi et al., 2021) 83.2 40.0 33.9 49.6 ADV. GRADMATCH (Dolatabadi et al., 2021) 83.1 39.2 34.1 48.9 Ours 88.2 0.8 70.0 1.2 70.9 0.6 55.0 0.7

Table 5. Comparison with other adversarial purification methods using the BPDA+EOT attack with ℓ perturbations. (a) We evaluate on the eyeglasses attribute classifier for Celeb A-HQ, where ϵ = 16/255. See Appendix C.3 for similar results on the smiling attribute. Note that OPT and ENC denote the optimization-based and econder-based GAN inversions, respectively, and ENC+OPT implies a combination of OPT and ENC. (b) We evaluate on Wide Res Net-28-10 for CIFAR-10, and keep the experimental settings the same with (Hill et al., 2021), where ϵ = 8/255. ( The purification is actually a variant of the LD sampling.)

(a) Celeb A-HQ Method Purification Standard Acc Robust Acc (Vahdat & Kautz, 2020) VAE 99.43 0.00 (Karras et al., 2020) GAN+OPT 97.76 10.80 (Chai et al., 2021) GAN+ENC+OPT 99.37 26.37 (Richardson et al., 2021) GAN+ENC 93.95 75.00 Ours (t = 0.4) Diffusion 93.87 0.18 89.47 1.18 Ours (t = 0.5) Diffusion 93.77 0.30 90.63 1.10

(b) CIFAR-10 Method Purification Standard Acc Robust Acc (Song et al., 2018) Gibbs Update 95.00 9.00 (Yang et al., 2019) Mask+Recon. 94.00 15.00 (Hill et al., 2021) EBM+LD 84.12 54.90 (Yoon et al., 2021) DSM+LD 86.14 70.01 Ours (t = 0.075) Diffusion 91.03 0.35 77.43 0.19 Ours (t = 0.1) Diffusion 89.02 0.21 81.40 0.16

These results clearly demonstrate the effectiveness of our method in defending against ℓ threat models on Image Net. Note that for the adversarial training baselines, the training recipes for CNNs cannot be directly applied to Vi Ts due to the over-regularization issue (Bai et al., 2021). However, our method is agnostic to classifier architectures.

5.3. Defense Against Unseen Threats

The main drawback of the adversarial training baselines is their poor generalization to unseen attacks: even if models are robust against a specific threat model, they are still fragile against other threat models. To see this, we evaluate each method with three attacks: ℓ , ℓ2 and St Adv, shown in Table 4. Note that for the plain adversarial training methods with a specific attack objective (e.g., Adv Train - ℓ ), only other threat models (e.g., ℓ2 and St Adv) are considered unseen. We thus mark the seen threats by gray.

We can see that our method is robust to all three unseen threat models while the performances of these plain adversarial baselines drop significantly against unseen attacks. Compared with the state-of-the-art defense methods against unseen threat models (Laidlaw et al., 2021; Dolatabadi et al., 2021), our method achieves significantly better standard accuracy and robust accuracies across all three attacks. For instance, the robust accuracy of our method improves over the previously best performance by +30%, +36% and +5.4% on ℓ , ℓ2 and St Adv, respectively.

5.4. Comparison with Other Purification Methods

Because most prior adversarial purification methods have an optimization or sampling loop in their defense process (Hill et al., 2021), they cannot be evaluated directly with the strongest white-box adaptive attacks, such as Auto Attack. To this end, we use the BPDA+EOT attack (Tramer et al., 2020; Hill et al., 2021), an adaptive attack designed specifically for purification methods (with stochasticity), to evaluate our method and baselines for a fair comparison.

Celeb A-HQ We compare with other strong generative models, such as NVAE (Vahdat & Kautz, 2020) and Style GAN2 (Karras et al., 2020), that can be used to purify adversarial examples. The basic idea is to first encode adversarial images to latent codes, with which purified images are synthesized from the decoder (see Appendix B.3 for implementation details). We choose Celeb A-HQ for the comparsion because they both perform well on it. In Table 5a, we use the eyeglasses attribute to show that our method has much better robust accuracy (+15.63%) than the best performing baseline while also maintaining a relatively high standard accuracy. We defer the similar results on the smiling attribute to Appendix C.3. These results demonstrate the superior performance of diffusion models in adversarial robustness than other generative models as a purification model.

CIFAR-10 In Table 5b, we compare our method with other adversarial purification methods on CIFAR-10, where

Diffusion Models for Adversarial Purification

0.06 0.08 0.10 0.12 0.14 0.16 0.18 diffusion timestep t *

Robust Acc - Robust Acc - 2 Robust Acc - St Adv Standard Acc

Figure 3. Impact of diffusion time t in our method on standard accuracy and robust accuracies against Auto Attack ℓ (ϵ = 8/255), ℓ2 (ϵ = 0.5) and St Adv (ϵ = 0.05) threat models, respectively, where we evaluate on Wide Res Net-28-10 for CIFAR-10.

the methods based on the LD sampling for purification are the state-of-the-art (Hill et al., 2021; Yoon et al., 2021). We observe that our method largely outperforms previous methods against the BPDA+EOT attack, with an absolute improvement of at least +11.31% in robust accuracy. Meanwhile, we can slightly trade-off robust accuracy for better standard accuracy by decreasing t , making it comparable to the best reported standard accuracy (i.e., 91.03% vs. 95.00%). These results show that our method becomes a new state-of-the-art in adversarial purification.

5.5. Ablation Studies

Impact of diffusion timestep t We first show how the diffusion timestep t affects the robustness performance of our method against different threat models in Figure 3. We can see that (i) the standard accuracy monotonically decreases with t since more label semantics are lost with the larger diffusion timestep, and (ii) all the robust accuracies first increase and then decrease as t becomes larger due to the trade-off as discussed in Section 3.1. Notably, the optimal timestep t for the best robust accuracy remains small but also varies across different threat models (e.g., ℓ : t =0.075, ℓ2: t =0.1, and St Adv: t =0.15). Since stronger perturbations need a larger diffusion timestep to be smoothed, it implies that St Adv (ϵ=0.05) perturbs the input images the most while ℓ2 (ϵ=0.5) does the least.

Impact of sampling strategy Given the pre-trained diffusion models, besides relying on VP-SDE, there are other ways of recovering clean images from the adversarial examples. Here we consider another two sampling strategies: (i) LD-SDE (i.e., an SDE formulation of the LD sampling that samples from an EBM, formed by our score function at t=0, instead of introducing forward and reverse diffusion processes), and (ii) VP-ODE (i.e., an equivalent ODE sampling derived from VP-SDE that solves the reverse generative pro-

Table 6. We compare different sampling strategies by evaluating on Wide Res Net-28-10 with Auto Attack ℓ (ϵ = 8/255) for CIFAR10. We use t = 0.1 for both VP-ODE and VP-SDE (Ours), while using the best hyperparameters after grid search for LD-SDE.

Sampling Standard Acc Robust Acc LD-SDE 87.36 0.09 38.54 1.55 VP-ODE 90.79 0.12 39.86 0.98 VP-SDE (Ours) 89.02 0.21 70.64 0.39

Adv Train (w/ extra data) Ours (w/o extra data) Adv Train + Ours

Figure 4. Combination of our method with adversarial training, where we evaluate on Wide Res Net-76-10 for CIFAR-10 with Auto Attack ℓ (ϵ = 8/255) and ℓ2 (ϵ = 0.5) threat models, respectively. Regarding adversarial training, we use the model in (Rebuffi et al., 2021) that is adversarially trained with extra data.

cess using the probability flow ODEs (Song et al., 2021b) while keeping the forward diffusion unchanged). Please see Appendix B.4 for more details about these sampling variants. In Table 6, we compare different sampling strategies with the same diffusion model.

Although each sampling strategy has a comparable standard accuracy, our method achieves a significantly better robust accuracy. To explain this, we hypothesize that (i) the LD sampling only uses the score function with clean images at timestep t=0, making it less robust to noisy (or perturbed) input images, while our method considers score functions at various noise levels. (ii) The ODE sampling introduces much less randomness to the defense model, due to its deterministic trajectories, and thus is more vulnerable to adaptive attacks from the randomized smoothing perspective (Cohen et al., 2019; Pinot et al., 2020). Inspired by this observation, we study adding more stochasticity using a randomized diffusion timestep t for the improved performance in Appendix C.4, where we find that more randomness in the purification method does not always leads to better robustness, implying the significance of introducing proper randomness in a more principled way from the forward and reverse processes.

Combination with adversarial training Since our proposed Diff Pure is an orthogonal defense method to adver-

Diffusion Models for Adversarial Purification

sarial training, we can also combine our method with adversarial training (i.e., feeding the purified images from our method to the adversarially trained classifiers). Figure 4 shows that this combination (i.e., Adv Train + Ours ) can improve the robust accuracies against Auto Attack ℓ and ℓ2 threat models, respectively. Besides, by comparing the results against the ℓ and ℓ2 threat models, the improvement from the combination over our method with the standard classifier (i.e., Ours ) becomes more significant when the adversarial training method with extra data (i.e., Adv Train ) is already on par with our method. Therefore, we can apply our method to the pre-existing adversarially trained classifiers for further improving the performance.

6. Conclusions

We proposed a new defense method called Diff Pure that applies diffusion models to purify adversarial examples before feeding them into classifiers. We also applied the adjoint method to compute full gradients of the SDE solver for evaluating with strong white-box adaptive attacks. To show the robustness performance of our method, we conducted extensive experiments on CIFAR-10, Image Net and Celeb A-HQ with different classifiers architectures including Res Net, Wide Res Net and Vi T to compare with the stateof-the-art adversarial training and adversarial purification methods. In defense of various strong adaptive attacks such as Auto Attack, St Adv and BPDA+EOT, our method largely outperforms previous approaches.

Despite the large improvements, our method has two major limitations: (i) the purification process takes much time (proportional to the diffusion timestep, see Appendix C.6), making our method inapplicable to the real-time tasks, and (ii) diffusion models are sensitive to image colors, making our method incapable of defending color-related corruptions. It is interesting to either apply recent works on accelerating diffusion models or design new diffusion models specifically for model robustness to overcome these two limitations.

Acknowledgement

We would like to thank the AIALGO team at NVIDIA and Anima Anandkumar s research group at Caltech for reading the paper and providing fruitful suggestions. We also thank the anonymous reviewers for helpful comments.

Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp. 274 283. PMLR, 2018.

Augustin, M., Meinke, A., and Hein, M. Adversarial robust-

ness on in-and out-distribution improves explainability. In European Conference on Computer Vision, pp. 228 245. Springer, 2020.

Bai, Y., Mei, J., Yuille, A., and Xie, C. Are transformers more robust than cnns? In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.

Barron, A. R. Entropy and the central limit theorem. The Annals of probability, pp. 336 342, 1986.

Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

Chai, L., Zhu, J.-Y., Shechtman, E., Isola, P., and Zhang, R. Ensembling with deep generative views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.

Choi, J., Kim, S., Jeong, Y., Gwon, Y., and Yoon, S. Ilvr: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14367 14376, 2021.

Cohen, J., Rosenfeld, E., and Kolter, Z. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, 2019.

Croce, F. and Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In ICML, 2020.

Croce, F., Andriushchenko, M., Sehwag, V., Debenedetti, E., Flammarion, N., Chiang, M., Mittal, P., and Hein, M. Robustbench: a standardized adversarial robustness benchmark. ar Xiv preprint ar Xiv:2010.09670, 2020.

Croce, F., Gowal, S., Brunner, T., Shelhamer, E., Hein, M., and Cemgil, T. Evaluating the adversarial robustness of adaptive test-time defenses. ar Xiv preprint ar Xiv:2202.13711, 2022.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. In Neural Information Processing Systems (Neur IPS), 2021.

Ding, G. W., Sharma, Y., Lui, K. Y. C., and Huang, R. Mma training: Direct input space margin maximization through adversarial training. In International Conference on Learning Representations, 2020.

Diffusion Models for Adversarial Purification

Dolatabadi, H. M., Erfani, S., and Leckie, C. ℓ -robustness and beyond: Unleashing efficient adversarial training. ar Xiv preprint ar Xiv:2112.00378, 2021.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.

Du, Y. and Mordatch, I. Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems, 2019.

Engstrom, L., Ilyas, A., Salman, H., Santurkar, S., and Tsipras, D. Robustness (python library), 2019. URL https://github.com/Madry Lab/ robustness.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 2014.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.

Gowal, S., Qin, C., Uesato, J., Mann, T., and Kohli, P. Uncovering the limits of adversarial training against norm-bounded adversarial examples. ar Xiv preprint ar Xiv:2010.03593, 2020.

Gowal, S., Rebuffi, S.-A., Wiles, O., Stimberg, F., Calian, D. A., and Mann, T. A. Improving robustness using generated data. Advances in Neural Information Processing Systems, 34, 2021.

Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D., Norouzi, M., and Swersky, K. Your classifier is secretly an energy based model and you should treat it like one. In International Conference on Learning Representations, 2020.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.

He, Z., Rakin, A. S., and Fan, D. Parametric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 588 597, 2019.

Hill, M., Mitchell, J. C., and Zhu, S.-C. Stochastic security: Adversarial defense using long-run dynamics of energybased models. In International Conference on Learning Representations, 2021.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Neural Information Processing Systems (Neur IPS), 2020.

Kang, Q., Song, Y., Ding, Q., and Tay, W. P. Stable neural ode with lyapunov-stable equilibrium points for defending against adversarial attacks. Neural Information Processing Systems (Neur IPS), 2021.

Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

Kim, G. and Ye, J. C. Diffusionclip: Text-guided image manipulation using diffusion models. ar Xiv preprint ar Xiv:2110.02711, 2021.

Kingma, D. P., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. Advances in Neural Information Processing Systems, 2021.

Kloeden, P. E. and Platen, E. Stochastic differential equations. In Numerical Solution of Stochastic Differential Equations, pp. 103 160. Springer, 1992.

Krizhevsky, A. Learning multiple layers of features from tiny images. (Technical Report) University of Toronto., 2009.

Laidlaw, C., Singla, S., and Feizi, S. Perceptual adversarial robustness: Defense against unseen threat models. In International Conference on Learning Representations, 2021.

Le Cun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, F. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.

Li, X., Wong, T.-K. L., Chen, R. T., and Duvenaud, D. Scalable gradients for stochastic differential equations. In International Conference on Artificial Intelligence and Statistics, pp. 3870 3882. PMLR, 2020.

Lyu, S. Interpretation and generalization of score matching. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 359 366, 2009.

Diffusion Models for Adversarial Purification

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.

Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. Sdedit: Image synthesis and editing with stochastic differential equations, 2021.

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mc Grew, B., Sutskever, I., and Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. ar Xiv preprint ar Xiv:2112.10741, 2021.

Pinot, R., Ettedgui, R., Rizk, G., Chevaleyre, Y., and Atif, J. Randomization matters how to defend against strong adversarial attacks. In International Conference on Machine Learning, 2020.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022.

Rebuffi, S.-A., Gowal, S., Calian, D. A., Stimberg, F., Wiles, O., and Mann, T. Fixing data augmentation to improve adversarial robustness. ar Xiv preprint ar Xiv:2103.01946, 2021.

Richardson, E., Alaluf, Y., Patashnik, O., Nitzan, Y., Azar, Y., Shapiro, S., and Cohen-Or, D. Encoding in style: a stylegan encoder for image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

Rony, J., Hafemann, L. G., Oliveira, L. S., Ayed, I. B., Sabourin, R., and Granger, E. Decoupling direction and norm for efficient gradient-based l2 adversarial attacks and defenses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4322 4330, 2019.

Saharia, C., Chan, W., Chang, H., Lee, C. A., Ho, J., Salimans, T., Fleet, D. J., and Norouzi, M. Palette: Image-to-image diffusion models. ar Xiv preprint ar Xiv:2111.05826, 2021.

Salman, H., Ilyas, A., Engstrom, L., Kapoor, A., and Madry, A. Do adversarially robust imagenet models transfer better? In Advances in Neural Information Processing Systems, 2020.

Samangouei, P., Kabkab, M., and Chellappa, R. Defensegan: Protecting classifiers against adversarial attacks using generative models. In International Conference on Learning Representations, 2018.

S arkk a, S. and Solin, A. Applied stochastic differential equations, volume 10. Cambridge University Press, 2019.

Sehwag, V., Mahloujifar, S., Handina, T., Dai, S., Xiang, C., Chiang, M., and Mittal, P. Robust learning meets generative models: Can proxy distributions improve adversarial robustness? ar Xiv preprint ar Xiv:2104.09425, 2021.

Shi, C., Holtz, C., and Mishne, G. Online adversarial purification based on self-supervised learning. In International Conference on Learning Representations, 2021.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, 2019.

Song, Y., Kim, T., Nowozin, S., Ermon, S., and Kushman, N. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. In International Conference on Learning Representations, 2018.

Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 2021a.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jegou, H. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 2021.

Tramer, F., Carlini, N., Brendel, W., and Madry, A. On adaptive attacks to adversarial example defenses. Advances in Neural Information Processing Systems, 33, 2020.

Vahdat, A. and Kautz, J. NVAE: A deep hierarchical variational autoencoder. In Neural Information Processing Systems (Neur IPS), 2020.

Vahdat, A., Kreis, K., and Kautz, J. Score-based generative modeling in latent space. In Neural Information Processing Systems (Neur IPS), 2021.

Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011.

Wong, E., Rice, L., and Kolter, J. Z. Fast is better than free: Revisiting adversarial training. In International Conference on Learning Representations, 2020.

Diffusion Models for Adversarial Purification

Wu, D., Xia, S.-T., and Wang, Y. Adversarial weight perturbation helps robust generalization. ar Xiv preprint ar Xiv:2004.05884, 2020.

Xiao, C., Zhu, J.-Y., Li, B., He, W., Liu, M., and Song, D. Spatially transformed adversarial examples. In International Conference on Learning Representations, 2018.

Yang, Y., Zhang, G., Katabi, D., and Xu, Z. Me-net: Towards effective adversarial robustness with matrix estimation. In International Conference on Machine Learning, 2019.

Yoon, J., Hwang, S. J., and Lee, J. Adversarial purification with score-based generative models. In International Conference on Machine Learning, 2021.

Zagoruyko, S. and Komodakis, N. Wide residual networks. In British Machine Vision Conference 2016, 2016.

Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., and Jordan, M. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pp. 7472 7482. PMLR, 2019.

Zhang, J., Zhu, J., Niu, G., Han, B., Sugiyama, M., and Kankanhalli, M. Geometry-aware instance-reweighted adversarial training. In International Conference on Learning Representations, 2020.

Diffusion Models for Adversarial Purification

A. Proofs in Section 3

A.1. Proof of Theorem 3.1

Theorem A.1. Let {x(t)}t [0,1] be the diffusion process defined by the forward SDE in Eq. (1). If we denote by pt and qt the respective distributions of x(t) when x(0) p(x) (i.e., clean data distribution) and x(0) q(x) (i.e., adversarial sample distribution), we then have

DKL(pt||qt)

where the equality happens only when pt=qt. That is, the KL divergence of pt and qt monotonically decreases when moving from t=0 to t=1 through the forward SDE.

Proof: The proof follows (Song et al., 2021a). First, the Fokker-Planck equation (S arkk a & Solin, 2019) for the forward SDE in Eq. (1) is given by

t = x f(x, t)pt(x) 1

2g2(t) xpt(x)

= x f(x, t)pt(x) 1

2g2(t)pt(x) x log pt(x)

= x (hp(x, t)pt(x))

where we define hp(x, t) := 1

2g2(t) x log pt(x) f(x, t). Then if we assume pt(x) and qt(x) are smooth and fast decaying, i.e.,

lim xi pt(x)

xi log pt(x) = 0 and lim xi qt(x)

xi log qt(x) = 0, (9)

for any i = 1, , d, we can evaluate

DKL(pt||qt)

Z pt(x) log pt(x)

t log pt(x)

qt(x)dx + Z pt(x)

t dx | {z } =0

qt(x) qt(x)

(a) = Z x (hp(x, t)pt(x)) log pt(x)

qt(x)dx + Z pt(x)

qt(x) x (hq(x, t)qt(x)) dx

(b) = Z pt(x)[hp(x, t) hq(x, t)]T [ x log pt(x) x log qt(x)]dx

2g2(t) Z pt(x) x log pt(x) x log qt(x) 2 2dx

2g2(t)DF (pt||qt)

where (a) follows by plugging Eq. (8), (b) follows from the integration by parts and the assumption in Eq. (9), and (c) follows from the definition of the Fisher divergence: DF (pt||qt) := R pt(x) x log pt(x) x log qt(x) 2 2dx.

Since g2(t) > 0, and the Fisher divergence satisfies that DF (pt||qt) 0 and DF (pt||qt) = 0 if and only if pt = qt, we have

DKL(pt||qt)

where the equality happens only when pt=qt.

A.2. Proof of Theorem 3.2

Theorem A.2. If we assume the score function satisfies that sθ(x, t) 1

2Cs, the L2 distance between the clean data x and the purified data ˆx(0) given by Eq. (4) satisfies that if with a probability of at least 1 δ, we have

ˆx(0) x ϵa + p

e2γ(t ) 1Cδ + γ(t )Cs (10)

Diffusion Models for Adversarial Purification

where γ(t ) := R t

0 1 2β(s)ds and the constant Cδ :=

δ + 4 log 1

Proof: Denote by ϵa the adversarial perturbation, we have the adversarial example xa = x + ϵa, where x represents the clean image. Because the diffused adversarial example x(t ) through the forward diffusion process satisfies

α(t )xa + p

1 α(t )ϵ1 (11)

where α(t) = e R t 0 β(s)ds and ϵ1 N(0,I d), the L2 distance between the clean data x0 and the purified data ˆx(0) can be bounded as ˆx(0) x = x(t ) + (ˆx(0) x(t )) x

= x(t ) + Z 0

2β(t)[x + 2sθ(x, t)]dt + Z 0

x(t ) + Z 0

2β(t)xdt + Z 0

β(t)d w | {z } Integration of Linear SDE

t β(t)sθ(x, t)dt

where the second equation follows from the integration of the reverse-time SDE defined in Eq. (4), and in the last line we have separated the integration of the linear SDE from non-linear SDE involving the score function sθ(x, t) by using the triangle inequality.

The above linear SDE is a time-varying Ornstein Uhlenbeck process with a negative time increment that starts from t=t to t=0 with the initial value set to x(t ). Denote by x (0) its solution, from (S arkk a & Solin, 2019) we know x (0) follows a Gaussian distribution, where its mean µ(0) and covariance matrix Σ(0) are the solutions of the following two differential equations, respectively:

2β(t)µ (13)

dt = β(t)Σ + β(t)I d (14)

with the initial conditions µ(t ) = x(t ) and Σ(t ) = 0. By solving these two differential equations, we have that conditioned on x(t ), x (0) N(eγ(t )x(t ), (e2γ(t ) 1)I d), where γ(t ) := R t

0 1 2β(s)ds.

Using the reparameterization trick, we have:

x (0) x = eγ(t )x(t ) + p

e2γ(t ) 1ϵ2 x

= eγ(t ) e γ(t )(x + ϵa) + p

1 e 2γ(t)ϵ1 + p

e2γ(t ) 1ϵ2 x

e2γ(t ) 1(ϵ1 + ϵ2) + ϵa

where the the second equation follows by substituting Eq. (11). Since ϵ1 and ϵ2 are independent, the first term can be represented as a single zero-mean Normal variable with the variance 2(e2γ(t ) 1). Assuming that the norm of the score function sθ(x, t) is bounded by a constant 1

2Cs and ϵ N(0,I d), we have:

2(e2γ(t ) 1)ϵ + ϵa + γ(t )Cs (16)

Since ϵ 2 χ2(d), from the concentration inequality (Boucheron et al., 2013), we have

Pr( ϵ 2 d + 2

dσ + 2σ) e σ (17)

Let e σ = δ, we get

δ + 2 log 1

Therefore, with the probability of at least 1 δ, we have

ˆx(0) x ϵa + p

e2γ(t ) 1Cδ + γ(t )Cs (19)

Diffusion Models for Adversarial Purification

where the constant Cδ :=

δ + 4 log 1

A.3. Proof of Proposition 3.3

Proposition A.3. For the SDE in Eq. (4), the augmented SDE that computes the gradient L x(t ) of backpropagating through it is given by

, f , g, w, 0, t

where L ˆx(0) is the gradient of the objective L w.r.t. the output ˆx(0) of the SDE in Eq. (4), and

f ([x;z], t) =

f rev(x, t)

g(t) = grev(t)1d 0d

w(t) = w(1 t) w(1 t)

with 1d and 0d representing the d-dimensional vectors of all ones and all zeros, respectively.

Proof: Before applying the adjoint method, we first transform the reverse-time SDE in Eq. (4) to a forward SDE, by a change of variable t := 1 t such that t [1 t , 1] [0, 1]. With this, the equivalent forward SDE with positive time increments from t=1 t to t=1 becomes

ˆx(0) = sdeint(x(t ),f fwd, gfwd,w, 1 t , 1) (21)

where the drift and diffusion coefficients are f fwd(x, t) = f rev(x, 1 t)

gfwd(t) = grev(1 t)

By following the stochastic adjoint method proposed in (Li et al., 2020), the augmented SDE that computes the gradient L x(t ) of the objective L w.r.t. the input x(t ) of the SDE in Eq. (21) is given by

, f fwd, gfwd, wfwd, 1, t 1

where L ˆx(0) is the gradient of the objective L w.r.t. the output ˆx(0) of the SDE in Eq. (21), and the augmented drift

coefficient f fwd : R2d R R2d, the augmented diffusion coefficient gfwd : R R2d and the augmented Wiener process w(t) R2d are given by

f fwd([x;z], t) =

f fwd(x, t)

f fwd(x, t)

f rev(x, 1 + t)

f rev(x,1+t)

gfwd(t) = gfwd( t)1d 0d

= grev(1 + t)1d 0d

wfwd(t) = w( t) w( t)

with 1d and 0d representing the d-dimensional vectors of all ones and all zeros, respectively. Note that the augmented SDE in Eq. (22) moves from t = 1 to t = t 1. Similarly, with the change of variable t := 1 + t such that t [0, t ], we can rewrite the augmented SDE as

, f , g, w, 0, t

Diffusion Models for Adversarial Purification

f ([x;z], t) = f fwd([x;z], t 1) =

f rev(x, t)

g(t) = gfwd(t 1) = grev(t)1d 0d

w(t) = wfwd(t 1) = w(1 t) w(1 t)

and f rev and grev are given by Eq. (5).

B. More Details of Experimental Settings

B.1. Implementation Details of Our Method

First, our method requires solving two SDEs: a reverse-time denosing SDE in Eq. (4) to get purified images, and an augmented SDE in Eq. (6) to compute gradients through the SDE in Eq. (4). In experiments, we use the adjoint framework for SDEs named adjoint sdeint in the Torch SDE library: https://github.com/google-research/ torchsde for both adversarial purification and gradient evaluation. We use the simple Euler-Maruyama method to solve both SDEs with a fixed step size dt=10 3. Ideally, the step size should be as small as possible to ensure that our gradient computation has an infinitely small numerical error. However, small steps sizes come with a high computational cost due to the increase in the number of neural network evaluations. We empirically observe that the robust accuracy of our method barely change any more if we further reduce the step size from 10 3 to 10 4 during a sanity check. Hence, we use a step size of 10 3 for all experiments to save the time in the purification process through the SDE solver. Note that this step size is often used in the denoising diffusion models as well (Ho et al., 2020; Song et al., 2021b).

Second, our method also requires the pre-trained diffusion models. In experiments, we use different pre-trained models on three datasets: Score SDE (Song et al., 2021b) for CIFAR-10, Guided Diffusion (Dhariwal & Nichol, 2021) for Image Net and DDPM (Ho et al., 2020) for Celeb A-HQ. In specific, we use the vp/cifar10 ddpmpp deep continuous checkpoint from the score sde library: https://github.com/yang-song/score_sde for the CIFAR-10 experiments. We use the 256x256 diffusion (unconditional) checkpoint from the guided-diffusion library: https:// github.com/openai/guided-diffusion for the Image Net experiments. Finally, for the Celeb A-HQ experiments, we use the Celeb A-HQ checkpoint from the SDEdit library: https://github.com/ermongroup/SDEdit.

B.2. Implementation Details of Adversarial Attacks

Auto Attack We use Auto Attack to compare with the state-of-the-art adversarial training methods, as reported in the Robust Bench benchmark. To make a fair comparison, we uses their codebase: https://github.com/Robust Bench/ robustbench with default hyperparameters for evaluation. Similarly, we set ϵ = 8/255 and ϵ = 0.5 for Auto Attack ℓ and Auto Attack ℓ2, respectively, on CIFAR-10. For Auto Attack ℓ on Image Net, we set ϵ = 4/255.

There are two versions of Auto Attack: (i) the STANDARD version, which contains four attacks: APGD-CE, APGD-T, FAB-T and Square, and is mainly used for evaluating deterministic defense methods, and (ii) the RAND version, which contains two attacks: APGD-CE and APGD-DLR, and is used for evaluating stochastic defense methods. Because there is stochasticity in our method, we consider the RAND version and choose the default EOT=20 for both ℓ and ℓ2 after searching for the minimum EOT with which the robust accuracy does not further decrease (see Figure 6).

In practice, we find that in a few cases, the STANDARD version actually makes a stronger attack (indicated by a lower robust accuracy) to our method than the RAND version. Therefore, to measure the worse-case defense performance of our method, we run both the STANDARD version and the RAND version of Auto Attack, and report the minimum robust accuracy of these two versions as our final robust accuracy.

St Adv We use the St Adv attack to demonstrate that our method can defend against unseen threats beyond ℓp-norm attacks. We closely follow the codebase of PAT (Laidlaw et al., 2021): https://github.com/cassidylaidlaw/ perceptual-advex with default hyperparameters for evaluation. Moreover, we add the EOT to average out the stochasticity of gradients in our defense method. Similarly, we use EOT=20 by default for the St Adv attack after searching

Diffusion Models for Adversarial Purification

for the minimum EOT that the robust accuracy saturates (see Figure 6).

BPDA+EOT For many adversarial purification methods where there exists an optimization loop or non-differentiable operations, the BPDA attack is known as the strongest attack (Tramer et al., 2020). Taking the stochastic defense methods into account, the BPDA+EOT attack has become the default one when evaluating the state-of-the-art adversarial purification methods (Hill et al., 2021; Yoon et al., 2021). To this end, we use the BPDA+EOT implementation of (Hill et al., 2021): https://github.com/point0bar1/ebm-defense with default hyperparameters for evaluation.

B.3. Implementation Details of Baselines

Purification models on Celeb A-HQ We mainly consider the state-of-the-art VAEs and GANs as purification models for comparison, and in particular we use NVAE (Vahdat & Kautz, 2020) and Style GAN2 (Karras et al., 2020) in our experiments. To use NVAE as a purification model, we directly pass the adversarial images to its encoder and get the purified images from its decoder. To use Style GAN2 as a purification model, we consider three GAN inversion methods to first invert adversarial images into the latent space of Style GAN2, and then get the purified images through the Style GAN2 generator. The three GAN inversion methods that we use in our experiments are as follows:

GAN+OPT, an optimization-based GAN inversion method that minimizes the perceptual distance between the output image and the input image w.r.t the w+ latent code. We use the codebase: https://github.com/rosinality/ stylegan2-pytorch/blob/master/projector.py that closely follows the idea of (Karras et al., 2020) for the GAN+OPT implementation. The only difference is that the number of optimization iterations we use is n = 500 to save the computational time while the original number of optimization iterations is n = 1000. We find that for n 500, the recovered images of GAN+OPT do not change much if we increases n.

GAN+ENC, an encoder-based GAN inversion method that uses an extra encoder to encode the input image to the w+ latent code. We use the codebase: https://github.com/eladrich/pixel2style2pixel corresponding to the idea of pixel2style2pixel (p Sp) (Richardson et al., 2021) for the GAN+ENC implementation.

GAN+ENC+OPT, which combines the optimization-based and encoder-based GAN inversion methods. We use the codebase: https://github.com/chail/gan-ensembling corresponding to the idea of (Chai et al., 2021) that uses the w+ latent code from the encoder as the initial point for the optimization with n = 500 iterations.

B.4. Other Sampling Strategies

Here we provide more details of other sampling strategies based on the same pre-trained diffusion models.

LD-SDE We denote an adversarial image by xa and the corresponding clean image by x. Currently with the Langevin dynamics (LD) sampling for purification, given xa we are searching for x freely (Hill et al., 2021; Yoon et al., 2021). Our approach can be considered as conditional sampling of clean image given the adversarial image using p(x|xa) = R p(x(t )|xa)p(x|x(t ))dx(t ) where p(x(t )|xa) is first sampled by following the forward diffusion and p(x|x(t )) is then sampled by following the reverse diffusion process. However, there are different ways in which one can formulate this conditional sampling without introducing forward and reverse diffusion processes.

We can write p(x|xa) p(x)p(xa|x) where p(x) represent the distribution of clean images and p(xa|x) is the distribution adversarial image given the clean image which we will approximate it by a simple Gaussian distribution p(xa|x) = N(xa;x, σ2I d). We also do not have access to the true clean data distribution, but we will assume that p(x) eh(x) is denoted by a trained generator (more on this later).

As we can see above, we can assume that p(x|xa) e E(x|xa) with energy function E(x|xa) = h(x) + ||xa x||2 2 2σ2 . Similarly, sampling from p(x|xa) can be done by running the overdamped LD:

xt+1 = xt λ t

2 xt E(xt|xa) + η p

where λ t is the learning rate and η denotes the damping coefficient. When t is infinitely small, it corresponds to solving the following forward SDE (termed LD-SDE):

2 x E(x|xa)dt + η

Diffusion Models for Adversarial Purification

for t [0, 1] where w is the standard wiener process. Note the SDE above does not involve any diffusion and denoising and it only uses LD for sampling from a fixed energy function. Recall that xh(x) is exactly what we have learned by the score function at timestep t=0 in diffusion models, i.e., xh(x) sθ(x, 0). Thus, the LD-SDE formulation is

sθ(x, 0) + x xa

Note that there exist three hyperparameters (σ2, λ, η) that we have to tune for the best performance. In particular, σ2 controls the balance of the attraction term x xa (to make x stay close to xa) and the score function sθ(x, 0) (to make x follow the clean data distribution). When σ2 becomes infinitely large, there is no attraction term x xa, and the LD-SDE defined in Eq. (26) reduces to the SDE formulation of the normal LD sampling (Grathwohl et al., 2020; Hill et al., 2021; Yoon et al., 2021). Note that with this SDE formulation of the LD sampling, we can use the adjoint method as discussed in Section 3.2 for an evaluation with strong adaptive attacks, such as Auto Attack.

In experiments, to find the best set of hyperparameters (σ2, λ, η), we first perform a grid search on σ2 = {0.001, 0.01, 0.1, 1, 10, 100}, λ = {0.01, 0.1, 1, 10}, η = {0.1, 1, 5, 10}. We find that the best performing configuration is σ2 = 100, λ = 0.1, η = 1.0. Since σ2 = 100 works the best, it implies that LD-SDE performs better without the attraction term x xa in Eq. (26).

Table 7. Robust accuracies of baselines obtained from Robust Bench vs. from our experiments against ℓ threat model (ϵ = 8/255) with Auto Attack on CIFAR-10, obtained by different classifier architectures.

Method Extra Data Robust Acc Robust Acc (from Robust Bench) (from our experiments) Wide Res Net-28-10 (Zhang et al., 2020) 59.64 59.96 (Wu et al., 2020) 60.04 62.11 (Gowal et al., 2020) 62.80 62.70 (Wu et al., 2020) 56.17 59.18 (Rebuffi et al., 2021) 60.75 61.72 (Gowal et al., 2021) 63.44 65.24 Wide Res Net-70-16 (Gowal et al., 2020) 65.88 66.02 (Rebuffi et al., 2021) 66.58 68.56 (Gowal et al., 2020) 57.20 59.57 (Rebuffi et al., 2021) 64.25 64.46 (Gowal et al., 2021) 66.11 66.60

VP-ODE For the reverse generative VP-SDE defined by:

2β(t)[x + 2 x log pt(x)]dt + p

β(t)d w (27)

where the time flows backward from t = 1 to 0 and w is the reverse standard Wiener process, Song et al. (2021b) show that there exists an equivalent ODE whose trajectories share the same marginal probability densities pt(x(t)):

2β(t) [x + x log pt(x)] dt (28)

where the idea is to use the Fokker-Planck equation (S arkk a & Solin, 2019) to transform an SDE to an ODE (see Appendix D.1 for more details in (Song et al., 2021b)). Therefore, we can use the above ODE, termed VP-ODE, to replace the reverse generative VP-SDE in our method for purification. Similarly with this VP-ODE sampling, we can also use the adjoint method for an evaluation with strong adaptive attacks, such as Auto Attack.

B.5. Gradient Computation in an Analytic Example

Here we provide a simple example to show that the gradient obtained from the adjoint method will closely match its ground-truth value if the SDE solver has a small numerical error. In this example, we know the analytic solution of gradient through a reverse-time SDE, and thus we can compare the difference between the gradient from solving the augmented SDE in Eq. (6) and its analytic solution.

In specific, we assume the data follows a Gaussian distribution, i.e., x p0(x) = N(µ0, σ2 0). The nice property about the diffusion process in Eq. (3) is that if p0(x) is a Gaussian distribution, pt(x) is also Gaussian for all t. From Eq. (3)

Diffusion Models for Adversarial Purification

in VP-SDE, we have pt(x) = N(µt, σ2 t ) where µt = µ0 p

α(t) and σ2 t = 1 (1 σ2 0)α(t), with αt := e R t 0 β(s)ds. Therefore, if we fix µ0 and σ0 to particular values, we can easily evaluate pt (x) at the diffusion timestep t .

Recall that the reverse process can be described as:

2β(t)[x + 2 x log pt(x)]dt + p

where w(t) is a standard reverse-time Wiener process. Since we have pt(x) analytically, we can write x log pt(x) = x µt

σ2 t . So the reverse process is:

x(0) = sdeint(x(t ), frev, grev, w, t , 0) (29)

where the drift and diffusion coefficients are given by

frev(x, t) := 1

2β(t) (σ2 t 2)x + 2µt

grev(t) := p

Then, we can compute gradient (denoted by ϕadj) of x(0) w.r.t. x(t ) through the SDE in Eq. (29) using the adjoint method in Eq. (6), where we use the objective L := x(0) for simplicity.

On the other hand, let x(t) := xt, we can evaluate p(x0|xt) as well which is

p(x0|xt) p(x0)p(xt|x0) exp (x0 µ0)2

2σ2 0 (xt x0 αt)2

exp (1 αt + σ2 0αt)x2 0 2((1 αt)µ0 + σ2 0 αtxt)x0 2(1 αt)σ2 0

= N((1 αt)µ0 + σ2 0 αtxt 1 αt + σ2 0αt , (1 αt)σ2 0 1 αt + σ2 0αt )

where we use p(xt|x0) = N(x0 αt, 1 αt) by following Eq. (3). Thus, we can get the analytic solution (denoted by ϕana) of the gradient of x(0) w.r.t. x(t ) as follows:

ϕana := x(0)

x(t ) = σ2 0 αt 1 αt + σ2 0αt (31)

In experiments, we set µ0 = 0 and σ2 0 {0.01, 0.05, 0.1, 0.5, 1.0}, and we use the Euler-Maruyama method to solve our SDEs with different scales of step sizes. The difference between the numeric gradient and the analytic gradient vs. the step size is shown in Figure 5, where the numeric error of gradients is measured by |ϕana ϕadj|/|ϕana|. As the step size gets smaller, the numeric error monotonically decreases at the same rate in different settings. It implies that the gradient obtained from the adjoint method will closely match its ground-truth value if the step size in the SDE solver is small.

10 5 10 4 10 3 10 2 10 1

Numerical error of gradients

Figure 5. Impact of the step size in the Euler-Maruyama method to solve our SDEs on the numeric error of gradients with different σ2 0 values. As the step size gets smaller, the numeric error monotonically decreases at the same rate in different settings.

Diffusion Models for Adversarial Purification

C. More Experimental Results

C.1. Robust Accuracies of Our Method for Standard Attack and Black-box Attack

In general adaptive attacks are considered to be stronger than standard attack (i.e., non-adaptive). Following the checklist of Croce et al. (2022), we report the performance of Diff Pure for standard attacks in Table 8. We can see that 1) Auto Attack is effective on the static model as its robust accuracies are zero, and 2) standard attacks are not effective on our method as our robust accuracies against standard attacks are much better than those against adaptive attacks (ref. Tables 1-3).

The Auto Attack (the standard version) we have considered includes the black-box Square Attack, but evaluating with Square Attack separately is a more direct way to show insensitivity to gradient masking. We thus show the performance of Diff Pure for Square Attack in Table 9. We can see that 1) our method has much higher robust accuracies against Square Attack than the static model, and 2) our robust accuracies against Square Attack are higher than those against Auto Attack (ref. Table 1-3). These results directly show Diff Pure s insensitivity to gradient masking.

Table 8. Robust accuracies of our method against standard attacks, where we transfer Auto Attack from the static model (i.e., the classifier) to our method Diff Pure, and the other evaluation setting and hyperparameters are the same as those in Tables 1-3.

Dataset Network ℓp-norm Static Model Ours CIFAR-10 WRN-28-10 ℓ 0.00 89.58 0.49 CIFAR-10 WRN-28-10 ℓ2 0.00 90.37 0.24 Image Net Res Net-50 ℓ 0.00 67.01 0.97

Table 9. Robust accuracies of the static model (i.e., the classifier) and our method Diff Pure against Square Attack, where the other evaluation setting and hyperparameters are the same as those in Tables 1-3.

Dataset Network ℓp-norm Static Model Ours CIFAR-10 WRN-28-10 ℓ 0.33 85.42 0.65 CIFAR-10 WRN-28-10 ℓ2 21.42 88.02 0.23 Image Net Res Net-50 ℓ 9.25 62.88 0.65

Table 10. Robust accuracies of baselines obtained from Robust Bench vs. from our experiments against ℓ2 threat model (ϵ = 0.5) with Auto Attack on CIFAR-10, obtained by different classifier architectures. Methods marked by use Wide Res Net-34-10, with the same width but more layers than the default one.

Method Extra Data Robust Acc Robust Acc (from Robust Bench) (from our experiments) Wide Res Net-28-10 (Augustin et al., 2020) 76.25 77.93 (Rony et al., 2019) 66.44 66.41 (Ding et al., 2020) 66.09 67.77 (Wu et al., 2020) 73.66 72.85 (Sehwag et al., 2021) 76.12 75.39 (Rebuffi et al., 2021) 78.80 78.32 Wide Res Net-70-16 (Gowal et al., 2020) 80.53 79.88 (Rebuffi et al., 2021) 82.32 81.44 (Gowal et al., 2020) 74.50 74.03 (Rebuffi et al., 2021) 80.42 80.86

C.2. Robust Accuracies of Baselines Obtained from Robust Bench vs. from Our Experiments

When we compare with the state-of-the-art adversarial training methods in the Robust Bench benchmark, we use the default hyperparameters for the Auto Attack evaluation. However, since the computational time of evaluating our method with Auto Attack is high (usually taking 50-150 number of function evaluations per attack iteration), we compare the robust accuracy of our method with baselines on a fixed subset of 512 images that is randomly sampled from the test set.

To show the validity of the results on this subset, we compare the robust accuracies of baselines reported from Robust Bench (on the whole test set) vs. from our experiments (on the sampled subset), shown in Tables 7-11. We can see that for different datasets (CIFAR-10 and Image Net) and network architectures (Res Net and Wide Res Net), the gap in robust accuracies of most baselines is small (i.e., less than 1.5% discrepancy). Furthermore, the relative performances of different methods remain the same. These results demonstrate that it is both efficient and effective to evaluate on the fixed subset.

Diffusion Models for Adversarial Purification

Table 11. Robust accuracies of baselines obtained from Robust Bench vs. from our experiments against ℓ threat model (ϵ = 4/255) with Auto Attack on Image Net, obtained by different classifier architectures.

Method Extra Data Robust Acc Robust Acc (from Robust Bench) (from our experiments) Res Net-50 (Engstrom et al., 2019) 29.22 31.06 (Wong et al., 2020) 26.24 26.95 (Salman et al., 2020) 34.96 37.89 Wide Res Net-50-2 (Salman et al., 2020) 38.14 39.25

C.3. More Results of Comparison within Adversarial Purification on Celeb A-HQ

Here we compare with other adversarial purification methods by using the BPDA+EOT attack with ℓ perturbations on the smiling attribute classifier for Celeb A-HQ. The results are shown in Table 12. We can see that our method still largely outperforms all the baselines, with an absolute improvement of at least +18.78% in robust accuracy. Compared with the eyeglasses attribute, the smiling attribute is more difficult to classify, posing a bigger challenge to the defense method. Thus, the robust accuracies of most defense methods are much worse than those with the eyeglasses attribute classifier.

Table 12. We evaluate with the BPDA+EOT attack on the smiling attribute classifier for Celeb A-HQ, where ϵ = 16/255 for the ℓ perturbations. Note that OPT and ENC denote the optimization-based and econder-based GAN inversions, respectively, and ENC+OPT implies a combination of OPT and ENC.

Method Purification Standard Acc Robust Acc (Vahdat & Kautz, 2020) VAE 93.55 0.00 (Karras et al., 2020) GAN+OPT 93.49 3.41 (Chai et al., 2021) GAN+ENC+OPT 93.68 0.78 (Richardson et al., 2021) GAN+ENC 90.55 40.40 Ours (t = 0.4) Diffusion 89.78 0.14 55.73 0.97 Ours (t = 0.5) Diffusion 87.62 0.22 59.12 0.37

C.4. More Results of Ablation Studies

Impact of EOT Because of the stochasticity in our purification process, we seek for the EOT value that is sufficient for different threat models to evaluate our method. In Figure 6, we present the robust accuracies of our method over three threat models - ℓ , ℓ2 and St Adv, respectively, with different number of EOT. We can see that these threat model has different behaviors with the number of EOT: our robust accuracy against the ℓ2 threat model seems to be not affected by EOT while our robust accuracies against the ℓ and St Adv threat models first decrease and then saturate as the number of EOT increases. In particular, our robust accuracy saturates at EOT=5 for ℓ and at EOT=20 for St Adv. Therefore, we consider EOT=20 by default for all our experiments unless stated otherwise, which should be sufficient for threat models to evaluate our method.

0 10 20 30 40 EOT

Figure 6. Impact of EOT on robust accuracies against the ℓ (ϵ = 8/255), ℓ2 (ϵ = 0.5) and St Adv (ϵ = 0.05) threat models, respectively, where we evaluate on Wide Res Net-28-10 for CIFAR-10.

Diffusion Models for Adversarial Purification

Randomizing diffusion timestep t Since randomness matters for adversarial purification methods (Hill et al., 2021; Yoon et al., 2021), we here consider introduce another source of randomness by randomizing the diffusion timestep t : Instead of using a fixed t , we uniformly sample t from the range [ t t, t + t] for every diffusion process. Table 13 shows the robustness performance with different t, where a larger t means stronger randomness introduced by perturbing t . We can see that the mean of standard accuracy monotonically decreases and the variance of robust accuracy monotonically increases with t due to the stronger randomness. Besides, a slightly small t may improve the robust accuracy in an average sense, while a large t may hurt it. It implies that there may also exist a sweet spot for the perturbation strength t of diffusion timestep t to get the best robustness performance.

Table 13. Uniformly sampling diffusion timestep t from [ t t, t + t], where t = 0.1. We evaluate with Auto Attack (the RAND version with EOT=20) ℓ (ϵ = 8/255) on Wide Res Net-28-10 for CIFAR-10. Note that t = 0 reduces to our method with a fixed t .

t Standard Acc Robust Acc 0 89.02 0.21 70.64 0.39 0.015 88.86 0.22 72.14 1.45 0.025 88.04 0.33 69.21 2.74

C.5. Purifying Adversarial Examples of Standard Attribute Classifiers

In Figure 7, we provide more visual examples of how our methods purify the adversarial examples of standard classifiers.

C.6. Inference Time with and without Diff Pure

Inference time (in seconds) by varying diffusion timestep t is reported in Table 14, where the inference time increases linearly with t . We believe our inference time can be reduced using recent fast sampling methods for diffusion models, but we leave it as the future work.

Table 14. Inference time with Diff Pure (t > 0) and without Diff Pure (t = 0) for a single image on an NVIDIA V100 GPU, where the time increase over t = 0 is given in parenthesis.

Dataset Network t =0 t =0.05 t =0.1 t =0.15 CIFAR-10 WRN-28-10 0.055 5.12( 93) 10.56( 190) 15.36( 278) Image Net Res Net-50 0.062 5.58( 90) 11.13( 179) 17.14( 276)

C.7. Crafting Examples Just for Diffusion Model

It is interesting to see if the adversary can craft examples just for the diffusion model, such that the recovered images from Diff Pure become different from the original clean images, which may also result in the misclassification. To this end, we use APGD (EOT=20) to attack the diffusion model only by maximizing the mean squared error (MSE) between diffusion model s outputs and input images. The results are given in Table 15. We see that attacking the diffusion model is less effective than attacking the whole defense system.

Table 15. Robust accuracies of attacking the diffusion model only ( Diffusion only ) and the whole defense system ( Diffusion+Clf ) using APGD ℓ and ℓ2 on CIFAR-10.

ℓp-norm Network t Diffusion only Diffusion+Clf ℓ (ϵ=8/255) WRN-28-10 0.1 85.04 0.86 75.91 0.74 ℓ2 (ϵ=0.5) WRN-28-10 0.075 90.82 0.42 84.83 0.09

Diffusion Models for Adversarial Purification

Adversarial t=0.3 t=0.15 t=0 Clean

(a) Smiling

Adversarial t=0.3 t=0.15 t=0 Clean

(b) Eyeglasses Figure 7. Our method purifies adversarial examples (first column) produced by attacking attribute classifiers using PGD ℓ (ϵ = 16/255), where t = 0.3. The middle three columns show the results of the SDE in Eq. (4) at different timesteps, and we observe the purified images at t=0 match the clean images (last column). Better zoom in to see how we remove adversarial perturbations.