# diffattack_evasion_attacks_against_diffusionbased_adversarial_purification__3c1524a4.pdf

Diff Attack: Evasion Attacks Against Diffusion-Based Adversarial Purification

Mintong Kang UIUC mintong2@illinois.edu

Dawn Song UC Berkeley dawnsong@berkeley.edu

Bo Li UIUC lbo@illinois.edu

Diffusion-based purification defenses leverage diffusion models to remove crafted perturbations of adversarial examples and achieve state-of-the-art robustness. Recent studies show that even advanced attacks cannot break such defenses effectively, since the purification process induces an extremely deep computational graph which poses the potential problem of vanishing/exploding gradient, high memory cost, and unbounded randomness. In this paper, we propose an attack technique Diff Attack to perform effective and efficient attacks against diffusion-based purification defenses, including both DDPM and score-based approaches. In particular, we propose a deviated-reconstruction loss at intermediate diffusion steps to induce inaccurate density gradient estimation to tackle the problem of vanishing/exploding gradients. We also provide a segment-wise forwarding-backwarding algorithm, which leads to memory-efficient gradient backpropagation. We validate the attack effectiveness of Diff Attack compared with existing adaptive attacks on CIFAR-10 and Image Net. We show that Diff Attack decreases the robust accuracy of models compared with SOTA attacks by over 20% on CIFAR-10 under ℓ attack (ϵ = 8/255), and over 10% on Image Net under ℓ attack (ϵ = 4/255). We conduct a series of ablations studies, and we find 1) Diff Attack with the deviated-reconstruction loss added over uniformly sampled time steps is more effective than that added over only initial/final steps, and 2) diffusion-based purification with a moderate diffusion length is more robust under Diff Attack.

1 Introduction

Since deep neural networks (DNNs) are found vulnerable to adversarial perturbations [52, 20], improving the robustness of neural networks against such crafted perturbations has become important, especially in safety-critical applications [18, 5, 54]. In recent years, many defenses have been proposed, but they are attacked again by more advanced adaptive attacks [7, 30, 11, 12]. One recent line of defense (diffusion-based purification) leverages diffusion models to purify the input images and achieves the state-of-the-art robustness. Based on the type of diffusion models the defense utilizes, diffusion-based purification can be categorized into score-based purification [34] which uses the score-based diffusion model [49] and DDPM-based purification [4, 62, 57, 51, 55, 56] which uses the denoising diffusion probabilistic model (DDPM) [25]. Recent studies show that even the most advanced attacks [12, 34] cannot break these defenses due to the challenges of vanishing/exploding gradient, high memory cost, and large randomness. In this paper, we aim to explore the vulnerabilities of such diffusion-based purification defenses, and design a more effective and efficient adaptive attack against diffusion-based purification, which will help to better understand the properties of diffusion process and motivate future defenses.

In particular, the diffusion-based purification defenses utilize diffusion models to first diffuse the adversarial examples with Gaussian noises and then perform sampling to remove the noises. In this way, the hope is that the crafted adversarial perturbations can also be removed since the training

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

distribution of diffusion models is clean [49, 25]. The diffusion length (i.e., the total diffusion time steps) is usually large, and at each time step, the deep neural network is used to estimate the gradient of the data distribution. This results in an extremely deep computational graph that poses great challenges of attacking it: vanishing/exploding gradients, unavailable memory cost, and large randomness. To tackle these challenges, we propose a deviated-reconstruction loss and a segment-wise forwarding-backwarding algorithm and integrate them as an effective and efficient attack technique Diff Attack.

Essentially, our deviated-reconstruction loss pushes the reconstructed samples away from the diffused samples at corresponding time steps. It is added at multiple intermediate time steps to relieve the problem of vanishing/exploding gradients. We also theoretically analyze the connection between it and the score-matching loss [26], and we prove that maximizing the deviated-reconstruction loss induces inaccurate estimation of the density gradient of the data distribution, leading to a higher chance of attacks. To overcome the problem of large memory cost, we propose a segment-wise forwarding-backwarding algorithm to backpropagate the gradients through a long path. Concretely, we first do a forward pass and store intermediate samples, and then iteratively simulate the forward pass of a segment and backward the gradient following the chain rule. Ignoring the memory cost induced by storing samples (small compared with the computational graph), our approach achieves O(1) memory cost.

Finally, we integrate the deviated-reconstruction loss and segment-wise forwarding-backwarding algorithm into Diff Attack, and empirically validate its effectiveness on CIFAR-10 and Image Net. We find that (1) Diff Attack outperforms existing attack methods [34, 60, 53, 1, 2] by a large margin for both the score-based purification and DDPM-based purification defenses, especially under large perturbation radii; (2) the memory cost of our efficient segment-wise forwarding-backwarding algorithm does not scale up with the diffusion length and saves more than 10x memory cost compared with the baseline [4]; (3) a moderate diffusion length benefits the robustness of the diffusion-based purification since longer length will hurt the benign accuracy while shorter length makes it easier to be attacked; (4) attacks with the deviated-reconstruction loss added over uniformly sampled time steps outperform that added over only initial/final time steps. The effectiveness of Diff Attack and interesting findings will motivate us to better understand and rethink the robustness of diffusion-based purification defenses.

We summarize the main technical contributions as follows:

We propose Diff Attack, a strong evasion attack against the diffusion-based adversarial purification defenses, including score-based and DDPM-based purification. We propose a deviated-reconstruction loss to tackle the problem of vanishing/exploding gradient, and theoretically analyze its connection with data density estimation. We propose a segment-wise forwarding-backwarding algorithm to tackle the high memory cost challenge, and we are the first to adaptively attack the DDPM-based purification defense, which is hard to attack due to the high memory cost. We empirically demonstrate that Diff Attack outperforms existing attacks by a large margin on CIFAR-10 and Image Net. Particularly, Diff Attack decreases the model robust accuracy by over 20% for ℓ attack (ϵ = 8/255) on CIFAR-10, and over 10% on Image Net under ℓ attack (ϵ = 4/255). We conduct a series of ablation studies and show that (1) a moderate diffusion length benefits the model robustness, and (2) attacks with the deviated-reconstruction loss added over uniformly sampled time steps outperform that added over only initial/final time steps.

2 Preliminary

There are two types of diffusion-based purification defenses, DDPM-based purification, and scorebased purification, which leverage DDPM [46, 25] and score-based diffusion model [49] to purify the adversarial examples, respectively. Next, we will introduce the basic concepts of DDPM and score-based diffusion models.

Denote the diffusion process indexed by time step t with the diffusion length T by {xt}T t=0. DDPM constructs a discrete Markov chain {xt}T t=0 with discrete time variables t following p(xt|xt 1) = N(xt; 1 βtxt 1, βt I) where βt is a sequence of positive noise scales (e.g., linear scheduling, cosine scheduling [33]). Considering αt := 1 βt, αt := Πt s=1αs, and

βt(1 αt 1)/(1 αt), the reverse process (i.e., sampling process) can be formulated as:

xt 1 = 1 αt

xt 1 αt 1 αt ϵθ(xt, t) + σtz (1)

where z is drawn from N(0, I). ϵθ parameterized with θ is the model to approximate the perturbation ϵ in the diffusion process and is trained via the density gradient loss Ld:

β2 t 2σ2 t αt(1 αt) ϵ ϵθ( αtx0 +

1 αtϵ, t) 2 2

where ϵ is drawn from N(0, I) and t is uniformly sampled from [T] := {1, 2, ..., T}.

Score-based diffusion model formulates diffusion models with stochastic differential equations (SDE). The diffusion process {xt}T t=0 is indexed by a continuous time variable t [0, 1]. The diffusion process can be formulated as: dx = f(x, t)dt + g(t)dw (3) where f(x, t) : Rn 7 Rn is the drift coefficient characterizing the shift of the distribution, g(t) is the diffusion coefficient controlling the noise scales, and w is the standard Wiener process. The reverse process is characterized via the reverse time SDE of Equation (3):

dx = [f(x, t) g(t)2 x log pt(x)]dt + g(t)dw (4)

where x log pt(x) is the time-dependent score function that can be approximated with neural networks sθ parameterized with θ, which is trained via the score matching loss Ls [26, 47]:

Ls = Et λ(t)Ext|x0 sθ(xt, t) xt log(p(xt|x0)) 2 2 (5)

where λ : [0, 1] R is a weighting function and t is uniformly sampled over [0, 1].

3 Diff Attack

3.1 Evasion attacks against diffusion-based purification

A class of defenses leverages generative models for adversarial purification [43, 48, 45, 60]. The adversarial images are transformed into latent representations, and then the purified images are sampled starting from the latent space using the generative models. The process is expected to remove the crafted perturbations since the training distribution of generative models is assumed to be clean. With diffusion models showing the power of image generation recently [15, 39], diffusion-based adversarial purification has achieved SOTA defense performance [34, 4].

We first formulate the problem of evasion attacks against diffusion-based purification defenses. Suppose that the process of diffusion-based purification, including the diffusion and reverse process, is denoted by P : Rn 7 Rn where n is the dimension of the input x0, and the classifier is denoted by F : Rn 7 [K] where K is the number of classes. Given an input pair (x0, y), the adversarial example x0 satisfies:

arg max i [K] Fi(P( x0)) = y s.t. d(x0, x0) δmax (6)

where Fi( ) is the i-th element of the output, d : Rn Rn 7 R is the distance function in the input space, and δmax is the perturbation budget.

Since directly searching for the adversarial instance x0 based on Equation (6) is challenging, we often use a surrogate loss L to solve an optimization problem:

max x0 L(F(P( x0)), y) s.t. d(x0, x0) δmax (7)

where P( ) is the purification process with DDPM (Equation (1)) or score-based diffusion (Equations (3) and (4)), and the surrogate loss L is often selected as the classification-guided loss, such as CW loss [7], Cross-Entropy loss and difference of logits ratio (DLR) loss [12]. Existing adaptive attack methods such as PGD [30] and APGD attack [12] approximately solve the optimization problem in Equation (7) via computing the gradients of loss L with respect to the decision variable x0 and iteratively updating x0 with the gradients.

Diffusion process Reverse process Segment-wise forwarding-backwarding (Sec.3.3)

t=0 t=T/3 t=2T/3

Deviated-reconstruction loss (Sec.3.2)

Classifier ''Dog (GT)

Fish (Adv. Target)

Adv. Attacks

Diff Attack

Figure 1: Diff Attack against diffusion-based adversarial purification defenses. Diff Attack features the deviated-reconstruction loss that addresses vanishing/exploding gradients and the segment-wise forwarding-backwarding algorithm that leads to memory-efficient gradient backpropagation.

However, we observe that the gradient computation for the diffusion-based purification process is challenging for three reasons: 1) the long sampling process of the diffusion model induces an extremely deep computational graph which poses the problem of vanishing/exploding gradient [2], 2) the deep computational graph impedes gradient backpropagation, which requires high memory cost [60, 4], and 3) the diffusion and sampling process introduces large randomness which makes the calculated gradients unstable and noisy.

To address these challenges, we propose a deviated-reconstruction loss (in Section 3.2) and a segmentwise forwarding-backwarding algorithm (in Section 3.3) and design an effective algorithm Diff Attack by integrating them into the attack technique (in Section 3.4).

3.2 Deviated-reconstruction loss

In general, the surrogate loss L in Equation (7) is selected as the classification-guided loss, such as CW loss, Cross-Entropy loss, or DLR loss. However, these losses can only be imposed at the classification layer, and induce the problem of vanishing/exploding gradients [2] due to the long diffusion length. Specifically, the diffusion purification process induces an extremely deep graph. For example, Diff Pure applies hundreds of iterations of sampling and uses deep UNet with tens of layers as score estimators. Thus, the computational graph consists of thousands of layers, which could cause the problem of gradient vanishing/exploding. Similar gradient problems are also mentioned with generic score-based generative purification (Section 4, 5.1 in [60]). Backward path differentiable approximation (BPDA) attack [2] is usually adopted to overcome such problems, but the surrogate model of the complicated sampling process is hard to find, and a simple identity mapping function is demonstrated to be ineffective in the case [34, 4, 60].

To overcome the problem of exploding/vanishing gradients, we attempt to impose intermediate guidance during the attack. It is possible to build a set of classifiers on the intermediate samples in the reverse process and use the weighted average of the classification-guided loss at multiple layers as the surrogate loss L. However, we observe that the intermediate samples are noisy, and thus using classifier F that is trained on clean data cannot provide effective gradients. One solution is to train a set of classifiers with different noise scales and apply them to intermediate samples to impose classification-guided loss, but the training is too expensive considering the large diffusion length and variant noise scales at different time steps. Thus, we propose a deviated-reconstruction loss to address the challenge via imposing discrepancy for samples between the diffusion and reverse processes adversarially to provide effective loss at intermediate time steps.

Concretely, since a sequence of samples is generated in the diffusion and reverse processes, effective loss imposed on them would relieve the problem of vanishing/exploding gradient and benefit the optimization. More formally, let xt, x t be the samples at time step t in the diffusion process and the reverse process, respectively. Formally, we maximize the deviated-reconstruction loss Ldev

formulated as follows: max Ldev = Et[α(t)Ext,x t|x0d(xt, x t)] (8)

where α( ) is time-dependent weight coefficients and d(xt, x t) is the distance between noisy image xt in the diffusion process and corresponding sampled image x t in the reverse process. The expectation over t is approximated by taking the average of results at uniformly sampled time steps in [0, T], and the loss at shallow layers in the computational graph (i.e., large time step t) helps relieve the problem of vanishing/exploding gradient. The conditional expectation over xt, x t given x0 is approximated by purifying x0 multiple times and taking the average of the loss.

Intuitively, the deviated-reconstruction loss in Equation (8) pushes the reconstructed sample x t in the reverse process away from the sample xt at the corresponding time step in the diffusion process, and finally induces an inaccurate reconstruction of the clean image. Letting qt(x) and q t(x) be the distribution of xt and x t, we can theoretically prove that the distribution distance between qt(x) and q t(x) positively correlates with the score-matching loss of the score-based diffusion or the density gradient loss of the DDPM. In other words, maximizing the deviated-reconstruction loss in Equation (8) induces inaccurate data density estimation, which results in the discrepancy between the sampled distribution and the clean training distribution. Theorem 1. Consider adversarial sample x0 := x0 + δ, where x0 is the clean example and δ is the perturbation. pt(x),p t(x),qt(x),q t(x) are the distribution of xt,x t, xt, x t where x t represents the reconstruction of xt in the reverse process. DT V ( , ) measures the total variation distance. Given a VP-SDE parameterized by β( ) and the score-based model sθ with mild assumptions that x log pt(x) sθ(x, t) 2 2 Lu, DT V (pt, p t) ϵre, and a bounded score function by M (specified in Appendix C.1), we have:

DT V (qt, q t) 1

Et,x|x0 sθ(x, t) x log q t(x) 2 2 + C1 + q

2 2 exp{ C2 δ 2 2} + ϵre (9)

C1 = (Lu + 8M 2) R T t β(t)dt, C2 = (8(1 Πt s=1(1 βs))) 1.

Proof sketch. We first use the triangular inequality to upper bound DT V (qt, q t) with DT V (qt, pt) + DT V (pt, p t) + DT V (p t, q t). DT V (qt, pt) can be upper bounded by a function of the Hellinger distance H(qt, pt), which can be calculated explicitly. DT V (pt, p t) can be upper bounded by the reconstruction error ϵre by assumption. To upper bound DT V (p t, q t), we can leverage Pinker s inequality to alternatively upper bound the KL-divergence between p t and q t which can be derived by using the Fokker-Planck equation [44] in the reverse SDE. Remark. A large deviated-reconstruction loss can indicate a large total variation distance DT V (qt, q t), which is the lower bound of a function with respect to the score-matching loss Et,x sθ(x, t) x log q t(x) 2 2 (in RHS of Equation (9)). Therefore, we show that maximizing the deviatedreconstruction loss implicitly maximizes the score-matching loss, and thus induces inaccurate data density estimation to perform an effective attack. The connection of deviated-reconstruction loss and the density gradient loss for DDPM is provided in Thm. 3 in Appendix C.2.

3.3 Segment-wise forwarding-backwarding algorithm

Adaptive attacks against diffusion-based purification require gradient backpropagation through the forwarding path. For diffusion-based purification, the memory cost scales linearly with the diffusion length T and is not feasible in a realistic application. Therefore, existing defenses either use a surrogate model for gradient approximation [55, 56, 60, 45] or consider adaptive attacks only for a small diffusion length [4], but the approximation can induce error and downgrade the attack performance a lot. Recently, Diff Pure [34] leverages the adjoint method [28] to backpropagate the gradient of SDE within reasonable memory cost and enables adaptive attacks against score-based purification. However, it cannot be applied to a discrete process, and the memory-efficient gradient backpropagation algorithm is unexplored for DDPM. Another line of research [9, 8, 19] proposes the technique of gradient checkpointing to perform gradient backpropagation with memory efficiency. Fewer activations are stored during forwarding passes, and the local computation graph is constructed via recomputation. However, we are the first to apply the memory-efficient backpropagation technique to attack diffusion purification defenses and resolve the problem of memory cost during attacks, which is realized as a challenging problem by prior attacks against purification defenses [34, 60]. Concretely, we propose a segment-wise forwarding-backwarding algorithm, which leads to memory-efficient gradient computation of the attack loss with respect to the adversarial examples.

We first feed the input x0 to the diffusion-based purification process and store the intermediate samples x1, x2, ..., x T in the diffusion process and x T , x T 1, ..., x 0 in the reverse process sequentially. For ease of notation, we have xt+1 = fd(xt) and x t = fr(x t+1) for t [0, T 1]. Then we can backpropagate the gradient iteratively following:

L x t+1 = L

x t x t+1 = L

fr(x t+1) x t+1 (10)

At each time step t in the reverse process, we only need to store the gradient L/ x t, the intermediate sample x t+1 and the model fr to construct the computational graph. When we backpropagate the gradients at the next time step t + 1, the computational graph at time step t will no longer be reused, and thus, we can release the memory of the graph at time step t. Therefore, we only have one segment of the computational graph used for gradient backpropagation in the memory at each time step. We can similarly backpropagate the gradients in the diffusion process. Ignoring the memory cost of storing intermediate samples (usually small compared to the memory cost of computational graphs), the memory cost of our segment-wise forwarding-backwarding algorithm is O(1) (validated in Figure 3).

We summarize the detailed procedures in Algorithm 1 in Appendix B. It can be applied to gradient backpropagation through any discrete Markov process with a long path. Basically, we 1) perform the forward pass and store the intermediate samples, 2) allocate the memory of one segment of the computational graph in the memory and simulate the forwarding pass of the segment with intermediate samples, 3) backpropagate the gradients through the segment and release the memory of the segment, and 4) go to step 2 and consider the next segment until termination.

3.4 Diff Attack Technique

Currently, Auto Attack [12] holds the state-of-the-art attack algorithm, but it fails to attack the diffusion-based purification defenses due to the challenge of vanishing/exploding gradient, memory cost and large randomness. To specifically tackle the challenges, we integrate the deviatedreconstruction loss (in Section 3.2) and the segment-wise forwarding-backwarding algorithm (in Section 3.3) as an attack technique Diff Attack against diffusion-based purification, including the scorebased and DDPM-based purification defenses. The pictorial illustration of Diff Attack is provided in Figure 1.

Concretely, we maximize the surrogate loss L as the optimization objective in Equation (7):

max L = Lcls + λLdev (11)

where Lcls is the CE loss or DLR loss, Ldev is the deviated-reconstruction loss formulated in Equation (8), and λ is the weight coefficient. During the optimization, we use the segment-wise forwarding-backwarding algorithm for memory-efficient gradient backpropagation. Note that Ldev suffers less from the gradient problem compared with Lcls, and thus the objective of Ldev can be optimized more precisely and stably, but it does not resolve the gradient problem of Lcls. On the other hand, the optimization of Ldev benefits the optimization of Lcls in the sense that Ldev can induce a deviated reconstruction of the image with a larger probability of misclassification. λ controls the balance of the two objectives. A small λ can weaken the deviated-reconstruction object and make the attack suffer more from the vanishing/exploded gradient problem, while a large λ can downplay the guidance of the classification loss and confuse the direction towards the decision boundary of the classifier.

Attack against randomized diffusion-based purification. Diff Attack tackles the randomness problem from two perspectives: 1) sampling the diffused and reconstructed samples across different time steps multiple times as in Equation (8) (similar to EOT [3]), and 2) optimizing perturbations for all samples including misclassified ones in all steps. Perspective 1) provides a more accurate estimation of gradients against sample variance of the diffusion process. Perspective 2) ensures a more effective and stable attack optimization since the correctness of classification is of high variance over different steps in the diffusion purification setting. Formally, the classification result of a sample can be viewed as a Bernoulli distribution (i.e., correct or false). We should reduce the success rate of the Bernoulli distribution of sample classification by optimizing them with a larger attack loss, which would lead to lower robust accuracy. In other words, one observation of failure in classification does not indicate that the sample has a low success rate statistically, and thus, perspective 2) helps

Table 1: Attack performance (Rob-Acc (%)) of Diff Attack and Adj Attack [34] against score-based purification on CIFAR-10.

Models T Cl-Acc ℓp Attack ϵ Method Rob-Acc Diff.

Wide Res Net-28-10 0.1 89.02 ℓ 8/255 Adj Attack 70.64 -23.76 Diff Attack 46.88

4/255 Adj Attack 82.81 -10.93 Diff Attack 71.88

0.075 91.03 ℓ2 0.5 Adj Attack 78.58 -14.52 Diff Attack 64.06

Wide Res Net-70-16 0.1 90.07 ℓ 8/255 Adj Attack 71.29 -25.98 Diff Attack 45.31

4/255 Adj Attack 81.25 -6.25 Diff Attack 75.00

0.075 92.68 ℓ2 0.5 Adj Attack 80.60 -10.29 Diff Attack 70.31

to continue optimizing the perturbations towards a lower success rate (i.e., away from the decision boundary). We provide the pseudo-codes of Diff Attack in Algorithm 2 in Appendix D.1.

4 Experimental Results

In this section, we evaluate Diff Attack from various perspectives empirically. As a summary, we find that 1) Diff Attack significantly outperforms other SOTA attack methods against diffusion-based defenses on both the score-based purification and DDPM-based purification models, especially under large perturbation radii (Section 4.2 and Section 4.3); 2) Diff Attack outperforms other strong attack methods such as the black-box attack and adaptive attacks against other adversarial purification defenses (Section 4.4); 3) a moderate diffusion length T benefits the model robustness, since too long/short diffusion length would hurt the robustness (Section 4.5); 4) our proposed segment-wise forwarding-backwarding algorithm achieves O(1)-memory cost and outperforms other baselines by a large margin (Section 4.6); and 5) attacks with the deviated-reconstruction loss added over uniformly sampled time steps outperform that added over only initial/final time steps (Section 4.7).

4.1 Experiment Setting

Dataset & model. We validate Diff Attack on CIFAR-10 [27] and Image Net [13]. We consider different network architectures for classification. Particularly, Wide Res Net-28-10 and Wide Res Net70-16 [61] are used on CIFAR-10, and Res Net-50 [23], Wide Res Net-50-2 (WRN-50-2), and Vi T (Dei T-S) [16] are used on Image Net. We use a pretrained score-based diffusion model [49] and DDPM [25] to purify images following [34, 4].

Evaluation metric. The performance of attacks is evaluated using the robust accuracy (Rob-Acc), which measures the ratio of correctly classified instances over the total number of test data under certain perturbation constraints. Following the literature [12], we consider both ℓ and ℓ2 attacks under multiple perturbation constraints ϵ. We also report the clean accuracy (Cl-Acc) for different approaches.

Baselines. To demonstrate the effectiveness of Diff Attack, we compare it with 1) SOTA attacks against score-based diffusion adjoint attack (Adj Attack) [34], 2) SOTA attack against DDPM-based diffusion Diff-BPDA attack [4], 3) SOTA black-box attack SPSA [53] and square attack [1], and 4) specific attack against EBM-based purification joint attack [60]. We defer more explanations of baselines and experiment details to Appendix D.2. The codes are publicly available at https: //github.com/kangmintong/Diff Attack.

Table 3: Attack performance (Rob-Acc (%)) of Diff Attack and Diff-BPDA [4] against DDPM-based purification on CIFAR-10.

Architecture T Cl-Acc ℓp Attack ϵ Method Rob-Acc Diff.

Wide Res Net-28-10 100 87.50 ℓ 8/255 Diff-BPDA 75.00 -20.31 Diff Attack 54.69

4/255 Diff-BPDA 76.56 -13.28 Diff Attack 63.28

75 90.62 ℓ2 0.5 Diff-BPDA 76.56 -8.59 Diff Attack 67.97

Wide Res Net-70-16 100 91.21 ℓ 8/255 Diff-BPDA 74.22 -14.84 Diff Attack 59.38

4/255 Diff-BPDA 75.78 -8.59 Diff Attack 67.19

75 92.19 ℓ2 0.5 Diff-BPDA 81.25 -9.37 Diff Attack 71.88

4.2 Attack against score-based purification

Table 2: Attack performance of Diff Attack and Adj Attack [34] against score-based adversarial purification with diffusion length T = 0.015 on Image Net under ℓ attack (ϵ = 4/255).

Models Cl-Acc Method Rob-Acc Diff.

Res Net-50 67.79 Adj Attack 40.93 -12.80 Diff Attack 28.13

WRN-50-2 71.16 Adj Attack 44.39 -13.14 Diff Attack 31.25

Dei T-S 73.63 Adj Attack 43.18 -10.37 Diff Attack 32.81

Diff Pure [34] presents the state-of-theart adversarial purification performance using the score-based diffusion models [49]. It proposes a strong adaptive attack (Adj Attack) which uses the adjoint method [28] to efficiently backpropagate the gradients through reverse SDE. Therefore, we select Adj Attack as the strong baseline and compare Diff Attack with it. The results on CIFAR-10 in Table 1 show that Diff Attack achieves much lower robust accuracy compared with Adj Attack under different types of attacks (ℓ and ℓ2 attack) with multiple perturbation constraints ϵ. Concretely, Diff Attack decreases the robust accuracy of models by over 20% under ℓ attack with ϵ = 8/255 (70.64% 46.88% on Wide Res Net-28-10 and 71.29% 45.31% on Wide Res Net-70-16). The effectiveness of Diff Attack also generalizes well to large-scale datasets Image Net as shown in Table 2. Note that the robust accuracy of the state-of-the-art non-diffusion-based purification defenses [38, 21] achieve about 65% robust accuracy on CIFAR-10 with Wide Res Net-28-10 under ℓ = 8/255 attack (ϵ = 8/255), while the performance of score-based purification under Adj Attack in the same setting is 70.64%. However, given the strong Diff Attack, the robust accuracy of score-based purification drops to 46.88%. It motivates us to think of more effective techniques to further improve the robustness of diffusion-based purification in future work.

4.3 Attack against DDPM-based purification

Another line of diffusion-based purification defenses [4, 55, 56] leverages DDPM [46] to purify the images with intentionally crafted perturbations. Since backpropagating the gradients along the diffusion and sampling process with a relatively large diffusion length is unrealistic due to the large memory cost, BPDA attack [2] is adopted as the strong attack against the DDPM-based purification. However, with our proposed segment-wise forwarding-backwarding algorithm, we can compute the gradients within a small budget of memory cost, and to our best knowledge, this is the first work to achieve adaptive gradient-based adversarial attacks against DDPM-based purification. We compare Diff Attack with Diff-BPDA attack [4] on CIFAR-10, and the results in Table 3 demonstrate that Diff Attack outperforms the baseline by a large margin under both ℓ and ℓ2 attacks.

4.4 Comparison with other adaptive attack methods

Table 4: Robust accuracy (%) of Diff Attack compared with other attack methods on CIFAR-10 with Wide Res Net28-10 under ℓ attack (ϵ = 8/255).

Method Score-based DDPM-based

SPSA 83.37 81.29 Square Attack 82.81 81.68 Joint Attack (Score) 72.74 Joint Attack (Full) 77.83 76.26 Diff-BPDA 78.13 75.00 Adj Attack 70.64

Diff Attack 46.88 54.69

Besides the Adj Attack and Diff-BPDA attacks against existing diffusion-based purification defenses, we also compare Diff Attack with other general types of adaptive attacks: 1) black-box attack SPSA [53] and 2) square attack [1], as well as 3) adaptive attack against score-based generative models joint attack (Score / Full) [60]. SPSA attack approximates the gradients by randomly sampling from a pre-defined distribution and using the finite-difference method. Square attack heuristically searches for adversarial examples in a low-dimensional space with the constraints of perturbation patterns. Joint attack (score) updates the input by the average of the classifier gradient and the output of the score estimation network, while joint attack (full) leverages the classifier gradients and the difference between the input and the purified samples. The results in Table 4 show that Diff Attack outperforms SPSA, square attack, and joint attack by a large margin on score-based and DDPM-based purification defenses. Note that joint attack (score) cannot be applied to the DDPM-based pipeline due to the lack of a score estimator. Adj Attack fails on the DDPM-based pipeline since it can only calculate gradients through SDE.

4.5 Robustness with different diffusion lengths

0.02 0.06 0.10 0.14 0.18 0.22 0.26 0.30

Clean Acc Robust Acc

Accuracy (%)

Diffusion length T (a) Score-based purification

20 60 100 140 180 220 260 300

Clean Acc Robust Acc

Accuracy (%)

Diffusion length T (b) DDPM-based purification Figure 2: The clean/robust accuracy (%) of diffusion-based purification with different diffusion length T under Diff Attack on CIFAR-10 with Wide Res Net-28-10 under ℓ attack (ϵ = 8/255).

We observe that the diffusion length plays an extremely important role in the effectiveness of adversarial purification. Existing DDPM-based purification works [56, 55] prefer a small diffusion length, but we find it vulnerable under our Diff Attack. The influence of the diffusion length T on the performance (clean/robust accuracy) of the purification defense methods is illustrated in Figure 2. We observe that 1) the clean accuracy of the purification defenses negatively correlates with the diffusion lengths since the longer diffusion process adds more noise to the input and induces inaccurate reconstruction of the input sample; and 2) a moderate diffusion length benefits the robust accuracy since diffusion-based purification with a small length makes it easier to compute the gradients for attacks, while models with a large diffusion length have poor clean accuracy that deteriorates the robust accuracy. We also validate the conclusion on Image Net in Appendix D.3.

4.6 Comparison of memory cost

0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000

Blau et al.,2022

Diff Attack

Memory Cost (MB)

T=3 T=5 T=10 T=15 T=20 T=30 T=1000 Diffusion length T

Figure 3: Comparison of memory cost of gradient backpropagation between [4] and Diff Attack with batch size 16 on CIFAR-10 with Wide Res Net-28-10 under ℓ attack.

Recent work [4] computes the gradients of the diffusion and sampling process to perform the gradient-based attack, but it only considers a small diffusion length (e.g., 14 on CIFAR-10). They construct the computational graph once and for all, which is extremely expensive for memory cost with a large diffusion length. We use a segment-wise forwarding-backwarding algorithm in Section 3.3 to avoid allocating the memory for the whole computational graph. In this part, we validate the memory efficiency of our approach compared to [4]. The results

in Figure 3 demonstrate that 1) the gradient backpropagation of [4] has the memory cost linearly correlated to the diffusion length and does not scale up to the diffusion length of 30, while 2) Diff Attack has almost constant memory cost and is able to scale up to extremely large diffusion length (T = 1000). The evaluation is done on an RTX A6000 GPU. In Appendix D.3, we provide comparisons of runtime between Diff Attack and [4] and demonstrate that Diff Attack reduces the memory cost with comparable runtime.

4.7 Influence of applying the deviated-reconstruction loss at different time steps

score-based

(0, 𝑇/3) (2𝑇/3, 𝑇) Uni(0, 𝑇)

Decreased Rob-Acc (%)

Figure 4: The impact of applying Ldev at different time steps on decreased robust accuracy (%). T is the diffusion length and Uni(0, T) represents uniform sampling.

We also show that the time steps at which we apply the deviated-reconstruction loss also influence the effectiveness of Diff Attack. Intuitively, the loss added at small time steps does not suffer from vanishing/exploding gradients but lacks supervision at consequent time steps, while the loss added at large time steps gains strong supervision but suffers from the gradient problem. The results in Figure 4 show that adding deviated-reconstruction loss to uniformly sampled time steps (Uni(0,T)) achieves the best attack performance and tradeoff compared with that of adding loss to the same number of partial time steps only at the initial stage ((0, T/3)) or the final stage ((2T/3, T)). For fair comparisons, we uniformly sample T/3 time steps (identical to partial stage guidance (0, T/3), (2T/3, T)) to impose Ldev.

5 Related Work

Adversarial purification methods purify the adversarial input before classification with generative models. Defense-gan [43] trains a GAN to restore the clean samples. Pixeldefend [48] utilizes an autoregressive model to purify adversarial examples. Another line of research [50, 22, 17, 24, 60] leverages energy-based model (EBM) and Markov chain Monte Carlo (MCMC) to perform the purification. More recently, diffusion models have seen wide success in image generation [15, 40, 41, 42, 31, 39]. They are also used to adversarial purification [34, 4, 62, 57, 51, 55, 56] and demonstrated to achieve the state-of-the-art robustness. In this work, we propose Diff Attack specifically against diffusion-based purification and show the effectiveness in different settings, which motivates future work to improve the robustness of the pipeline.

Adversarial attacks search for visually imperceptible signals which can significantly perturb the prediction of models [52, 20]. Different kinds of defense methods are progressively broken by advanced attack techniques, including white-box attack [6, 2, 32] and black-box attack [1, 53, 35]. [11, 12, 37, 59, 29] propose a systematic and automatic framework to attack existing defense methods. Despite attacking most defense methods, these approaches are shown to be ineffective against the diffusion-based purification pipeline due to the problem of vanishing/exploding gradient, memory cost, and randomness. Therefore, we propose Diff Attack to specifically tackle the challenges and successfully attack the diffusion-based purification defenses.

6 Conclusion

In this paper, we propose Diff Attack, including the deviated-reconstruction loss added on intermediate samples and a segment-wise forwarding-backwarding algorithm. We empirically demonstrate that Diff Attack outperforms existing adaptive attacks against diffusion-based purification by a large margin. We conduct a series of ablation studies and show that a moderate diffusion length benefits the model robustness, and attacks with the deviated-reconstruction loss added over uniformly sampled time steps outperform that added over only initial/final time steps, which will help to better understand the properties of diffusion process and motivate future defenses.

Acknolwdgement. This work is partially supported by the National Science Foundation under grant No. 1910100, No. 2046726, No. 2229876, DARPA GARD, the National Aeronautics and Space Administration (NASA) under grant no. 80NSSC20M0229, the Alfred P. Sloan Fellowship, and the Amazon research award.

[1] Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square attack: a query-efficient black-box adversarial attack via random search. In European Conference on Computer Vision, pages 484 501. Springer, 2020.

[2] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pages 274 283. PMLR, 2018.

[3] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In International conference on machine learning, pages 284 293. PMLR, 2018.

[4] Tsachi Blau, Roy Ganz, Bahjat Kawar, Alex Bronstein, and Michael Elad. Threat modelagnostic adversarial defense using diffusion models. ar Xiv preprint ar Xiv:2207.08089, 2022.

[5] Yulong Cao, Ningfei Wang, Chaowei Xiao, Dawei Yang, Jin Fang, Ruigang Yang, Qi Alfred Chen, Mingyan Liu, and Bo Li. Invisible for both camera and lidar: Security of multi-sensor fusion based perception in autonomous driving under physical-world attacks. In 2021 IEEE Symposium on Security and Privacy (SP), pages 176 194. IEEE, 2021.

[6] Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pages 3 14, 2017.

[7] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39 57. Ieee, 2017.

[8] Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible architectures for arbitrarily deep residual neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.

[9] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. ar Xiv preprint ar Xiv:1604.06174, 2016.

[10] Zhaoyu Chen, Bo Li, Shuang Wu, Kaixun Jiang, Shouhong Ding, and Wenqiang Zhang. Content-based unrestricted adversarial attack. ar Xiv preprint ar Xiv:2305.10665, 2023.

[11] Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. ar Xiv preprint ar Xiv:2010.09670, 2020.

[12] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pages 2206 2216. PMLR, 2020.

[13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

[14] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians. ar Xiv preprint ar Xiv:1810.08693, 2018.

[15] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780 8794, 2021.

[16] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

[17] Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems, 32, 2019.

[18] Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1625 1634, 2018.

[19] Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. Advances in neural information processing systems, 30, 2017.

[20] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

[21] Sven Gowal, Sylvestre-Alvise Rebuffi, Olivia Wiles, Florian Stimberg, Dan Andrei Calian, and Timothy A Mann. Improving robustness using generated data. Advances in Neural Information Processing Systems, 34:4218 4233, 2021.

[22] Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. ar Xiv preprint ar Xiv:1912.03263, 2019.

[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[24] Mitch Hill, Jonathan Mitchell, and Song-Chun Zhu. Stochastic security: Adversarial defense using long-run dynamics of energy-based models. ar Xiv preprint ar Xiv:2005.13525, 2020.

[25] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020.

[26] Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.

[27] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

[28] Xuechen Li, Ting-Kam Leonard Wong, Ricky TQ Chen, and David Duvenaud. Scalable gradients for stochastic differential equations. In International Conference on Artificial Intelligence and Statistics, pages 3870 3882. PMLR, 2020.

[29] Xiang Ling, Shouling Ji, Jiaxu Zou, Jiannan Wang, Chunming Wu, Bo Li, and Ting Wang. Deepsec: A uniform platform for security analysis of deep learning model. In 2019 IEEE Symposium on Security and Privacy (SP), pages 673 690. IEEE, 2019.

[30] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.

[31] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. ar Xiv preprint ar Xiv:2108.01073, 2021.

[32] Marius Mosbach, Maksym Andriushchenko, Thomas Trost, Matthias Hein, and Dietrich Klakow. Logit pairing methods can fool gradient-based attacks. ar Xiv preprint ar Xiv:1810.12042, 2018.

[33] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162 8171. PMLR, 2021.

[34] Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. Diffusion models for adversarial purification. In International Conference on Machine Learning (ICML), 2022.

[35] Nicolas Papernot, Patrick Mc Daniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506 519, 2017.

[36] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.

[37] Maura Pintor, Luca Demetrio, Angelo Sotgiu, Ambra Demontis, Nicholas Carlini, Battista Biggio, and Fabio Roli. Indicators of attack failure: Debugging and improving optimization of adversarial examples. ar Xiv preprint ar Xiv:2106.09947, 2021.

[38] Sylvestre-Alvise Rebuffi, Sven Gowal, Dan A Calian, Florian Stimberg, Olivia Wiles, and Timothy Mann. Fixing data augmentation to improve adversarial robustness. ar Xiv preprint ar Xiv:2103.01946, 2021.

[39] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684 10695, 2022.

[40] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1 10, 2022.

[41] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. ar Xiv preprint ar Xiv:2205.11487, 2022.

[42] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

[43] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. ar Xiv preprint ar Xiv:1805.06605, 2018.

[44] Simo Särkkä and Arno Solin. Applied stochastic differential equations, volume 10. Cambridge University Press, 2019.

[45] Changhao Shi, Chester Holtz, and Gal Mishne. Online adversarial purification based on self-supervision. ar Xiv preprint ar Xiv:2101.09387, 2021.

[46] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256 2265. PMLR, 2015.

[47] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574 584. PMLR, 2020.

[48] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. ar Xiv preprint ar Xiv:1710.10766, 2017.

[49] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.

[50] Vignesh Srinivasan, Csaba Rohrer, Arturo Marban, Klaus-Robert Müller, Wojciech Samek, and Shinichi Nakajima. Robustifying models against adversarial attacks by langevin dynamics. Neural Networks, 137:1 17, 2021.

[51] Jiachen Sun, Weili Nie, Zhiding Yu, Z Morley Mao, and Chaowei Xiao. Pointdp: Diffusiondriven purification against adversarial attacks on 3d point cloud recognition. ar Xiv preprint ar Xiv:2208.09801, 2022.

[52] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

[53] Jonathan Uesato, Brendan O Donoghue, Pushmeet Kohli, and Aäron van den Oord. Adversarial risk and the dangers of evaluating against weak attacks. In icml2018, 2018.

[54] Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. ar Xiv preprint ar Xiv:2306.11698, 2023.

[55] Jinyi Wang, Zhaoyang Lyu, Dahua Lin, Bo Dai, and Hongfei Fu. Guided diffusion model for adversarial purification. ar Xiv preprint ar Xiv:2205.14969, 2022.

[56] Quanlin Wu, Hang Ye, and Yuntian Gu. Guided diffusion model for adversarial purification from random noise. ar Xiv preprint ar Xiv:2206.10875, 2022.

[57] Chaowei Xiao, Zhongzhu Chen, Kun Jin, Jiongxiao Wang, Weili Nie, Mingyan Liu, Anima Anandkumar, Bo Li, and Dawn Song. Densepure: Understanding diffusion models towards adversarial robustness. ar Xiv preprint ar Xiv:2211.00322, 2022.

[58] Haotian Xue, Alexandre Araujo, Bin Hu, and Yongxin Chen. Diffusion-based adversarial sample generation for improved stealthiness and controllability. ar Xiv preprint ar Xiv:2305.16494, 2023.

[59] Chengyuan Yao, Pavol Bielik, Petar Tsankov, and Martin Vechev. Automated discovery of adaptive attacks on adversarial defenses. Advances in Neural Information Processing Systems, 34:26858 26870, 2021.

[60] Jongmin Yoon, Sung Ju Hwang, and Juho Lee. Adversarial purification with score-based generative models. In International Conference on Machine Learning, pages 12062 12072. PMLR, 2021.

[61] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016.

[62] Kui Zhang, Hang Zhou, Jie Zhang, Qidong Huang, Weiming Zhang, and Nenghai Yu. Ada3diff: Defending against 3d adversarial point clouds via adaptive diffusion. ar Xiv preprint ar Xiv:2211.16247, 2022.

A Broader Impact and Limitations

Broader impact. As an effective and popular way to explore the vulnerabilities of ML models, adversarial attacks have been widely studied. However, recent diffusion-based purification is shown hard to attack based on different trials, which raises an interesting question of whether it can be attacked. Our paper provides the first effective attack against such defenses to identify the vulnerability of diffusion-based purification for the community and inspire more effective defense approaches. In particular, we propose an effective evasion attack against diffusion-based purification defenses which consists of a deviated-reconstruction loss at intermediate diffusion steps to induce inaccurate density gradient estimation and a segment-wise forwarding-backwarding algorithm to achieve memoryefficient gradient backpropagation. The effectiveness of the deviated-reconstruction loss helps us to better understand the properties of diffusion purification. Concretely, there exist adversarial regions in the intermediate sample space where the score approximation model outputs inaccurate density gradients and finally misleads the prediction. The observation motivates us to design a more robust sampling process in the future, and one potential way is to train a more robust score-based model. Furthermore, the segment-wise forwarding-backwarding algorithm tackles the memory issue of gradient propagation through a long path. It can be applied to the gradient calculation of any discrete Markov process almost within a constant memory cost. To conclude, our attack motivates us to rethink the robustness of a line of SOTA diffusion-based purification defenses and inspire more effective defenses.

Limitations. In this paper, we propose an effective attack algorithm Diff Attack against diffusionbased purification defenses. A possible negative societal impact may be the usage of Diff Attack in safety-critical scenarios such as autonomous driving and medical imaging analysis to mislead the prediction of machine learning models. However, the foundation of Diff Attack and important findings about the diffusion process properties can benefit our understanding of the vulnerabilities of diffusionbased purification defenses and therefore motivate more effective defenses in the future. Concretely, the effectiveness of Diff Attack indicates that there exist adversarial regions in the intermediate sample space where the score approximation model outputs inaccurate density gradients and finally misleads the prediction. The observation motivates us to design a more robust sampling process in the future, and one potential way is to train a more robust score-based model. Furthermore, to control a robust sampling process, it is better to provide guidance across uniformly sampled time steps rather than only at the final stage according to our findings.

Algorithm 1 Segment-wise forwarding-backwarding algorithm (Py Torch-like pseudo-codes)

1: Input: fr, fd, L/ x 0, xi, x i (i [T]) 2: Output: L/ x0 3: for t = 1 to T do 4: Creat_Graph(fr(x t) x t 1) 5: L L/ x t 1 (fr(x t)) 6: L/ x t auto_grad(L , x t) 7: Release_Graph(fr(x t) x t 1) 8: end for 9: L/ x T L/ x T 10: for t = T 1 to 0 do 11: Creat_Graph(fd(xt) xt+1) 12: L ( L/ xt+1) (fd(xt)) 13: L/ xt auto_grad(L , xt) 14: Release_Graph(fd(xt) xt+1) 15: end for

B Efficient Gradient Backpropagation

In this section, we provide the Py Torch-like pseudo-codes of the segment-wise forwardingbackwarding algorithm. At each time step t in the reverse process, we only need to store the gradient L/ x t, the intermediate sample x t+1 and the model fr to construct the computational graph. When we backpropagate the gradients at the next time step t + 1, the computational graph at

time step t will no longer be reused, and thus, we can release the memory of the graph at time step t. Therefore, we only have one segment of the computational graph used for gradient backpropagation in the memory at each time step. We can similarly backpropagate the gradients in the diffusion process.

C.1 Proof of Thm. 1

Assumption C.1. Consider adversarial sample x0 := x0 +δ, where x0 is the clean example and δ is the perturbation. pt(x),p t(x),qt(x),q t(x) are the distribution of xt,x t, xt, x t where x t represents the reconstruction of xt at time step t in the reverse process. We consider a score-based diffusion model with a well-trained score-based model sθ parameterized by θ with the clean training distribution. Therefore, we assume that sθ can achieve a low score-matching loss given a clean sample and reconstruct it in the reverse process:

x log pt(x) sθ(x, t) 2 2 Lu (12)

DT V (pt, p t) ϵre (13)

where DT V ( , ) is the total variation distance. Lu and ϵre are two small constants that characterize the score-matching loss and the reconstruction error.

Assumption C.2. We assume the score function of data distribution is bounded by M:

x log pt(x) 2 M, x log qt(x) 2 M (14)

Lemma C.1. Consider adversarial sample x0 := x0 + δ, where x0 is the clean example and δ is the perturbation. pt(x),p t(x),qt(x),q t(x) are the distribution of xt,x t, xt, x t where x t represents the reconstruction of xt in the reverse process. Given a VP-SDE parameterized by β( ) and the score-based model sθ with Assumption C.2, we have:

DKL(p t, q t) = 1

t β(s)Ex|x0 x log p s(x) x log q s(x) 2 2ds + 4M 2 Z T

t β(s)ds (15)

Proof. The reverse process of VP-SDE can be formulated as follows:

dx = frev(x, t, pt)dt+grev(t)dw, where frev(x, t, pt) = 1

2β(t)x β(t) x log pt(x), grev(t) = p

β(t) (16) Using the Fokker-Planck equation [44] in Equation (16), we have:

p t(x) t = x

frev(x, t, pt)p t(x) 1

2g2 rev(t) xp t(x) (17)

2g2 rev(t) x log p t(x) frev(x, t, pt) p t(x) (18)

Similarly, applying the Fokker-Planck equation on the reverse SDE for q t(x), we can get:

q t(x) t = x

2g2 rev(t) x log q t(x) frev(x, t, qt) q t(x) (19)

We use the notation hp(x) = 1 2g2 rev(t) x log p t(x) frev(x, t, pt) and hq(x) = 1 2g2 rev(t) x log q t(x) frev(x, t, qt). Then according to [34](Theorem A.1), under the assump-

tion that p t(x) and q t(x) are smooth and fast decaying (i.e., limxi [p t(x) log p (x)/ xi] = 0, limxi [q t(x) log q (x)/ xi] = 0), we have:

DKL(p t, q t) t = Z p t(x)[hp(x, t) hq(x, t)]T [ x log p t(x) x log q t(x)]dx (20)

Plugging in Equations (18) and (19), we have:

DKL(p t, q t) t = Z p t(x)(1

2g2 rev(t) x log p t(x) x log q t(x) 2 2

+ β(t)[ x log pt(x) x log qt(x)]T [ x log p t(x) x log q t(x)])dx (21) Finally, we can derive as follows:

DKL(p t, q t) = Z T

X (p s(x)(1

2g2 rev(s) x log p s(x) x log q s(x) 2 2 (22)

+ β(s)[ x log ps(x) x log qs(x)]T [ x log p s(x) x log q s(x)]))dxds (23)

2g2 rev(s)Ex|x0 x log p s(x) x log q s(x) 2 2 + 4β(s)M 2)ds (24)

t β(s)Ex|x0 x log p s(x) x log q s(x) 2 2ds + 4M 2 Z T

t β(s)ds (25)

Theorem 2 (Thm. 1 in the main text). Consider adversarial sample x0 := x0 + δ, where x0 is the clean example and δ is the perturbation. pt(x),p t(x),qt(x),q t(x) are the distribution of xt,x t, xt, x t where x t represents the reconstruction of xt in the reverse process. DT V ( , ) measures the total variation distance. Given a VP-SDE parameterized by β( ) and the score-based model sθ with mild assumptions that x log pt(x) sθ(x, t) 2 2 Lu, DT V (pt, p t) ϵre, and a bounded score function by M (specified with details in Appendix C.1), we have:

DT V (qt, q t) 1

Et,x|x0 sθ(x, t) x log q t(x) 2 2 + C1

2 2 exp{ C2 δ 2 2} + ϵre (26)

where C1 = (Lu + 8M 2) R T t β(t)dt, C2 = 1 8(1 Πt s=1(1 βs)).

Proof. Since we consider VP-SDE here, we have:

f(x, t) = 1

2β(t)x, g(t) = p

frev(x, t) = 1

2β(t)x β(t) x log pt(x), grev(t) = p

Using the triangular inequality, the total variation distance between qt and q t can be decomposed as:

DT V (qt, q t) DT V (qt, pt) + DT V (pt, p t) + DT V (q t, p t) (29)

According to Assumption C.1, we have DT V (pt, p t) ϵre and thus, we only need to upper bound DT V (qt, pt) and DT V (q t, p t),respectively.

Using the notation αt := 1 β(t) and αt := Πt s=1αs, we have:

xt pt := N(xt; αtx0, (1 αt)I), xt qt := N( xt; αt x0, (1 αt)I) (30)

Therefore, we can upper bound the total variation distance between qt and pt as follows:

DT V (qt, pt) (a)

2H(xt, xt) (31)

1 exp{ 1 8(1 αt)δT δ} (32)

2 2 exp{ 1 8(1 αt) δ 2 2} (33)

where we leverage the inequality between the Hellinger distance H( , ) and total variation distance in Equation (31)(a) and we plug in the closed form of Hellinger distance between two Gaussian distribution [14] parameterized by µ1, Σ1, µ2, Σ2 in Equation (32)(b):

H(N(µ1, Σ1), N(µ2, Σ2))2 = 1 det(Σ1)1/4det(Σ2)1/4

det Σ1 + Σ2

8(µ1 µ2)T Σ1 + Σ2

Then, we will upper bound DT V (p t, q t). We first leverage Pinker s inequality to upper bound the total variation distance with the KL-divergence:

DT V (p t, q t)

1 2DKL(p t, q t) (35)

Then we plug in the results in Lemma C.1 to upper bound KL(p t, q t) and it follows that:

DT V (p t, q t) (36)

1 2DKL(p t, q t) (37)

t β(s)Ex|x0 x log p s(x) x log q s(x) 2 2ds + 2M 2 Z T

t β(s)ds (38)

t β(s)Ex|x0[ x log p s(x) sθ(x, s) 2 2 + sθ(x, s) x log q s(x) 2 2]ds + 2M 2 Z T

4 + 2M 2) Z T

t β(s)ds + 1

4Et,x|x0 sθ(x, t) x log q t(x) 2 2 (40)

where in Equation (40)(a), we leverage the fact that β( ) is bounded in [0, 1].

Combining Equations (29), (33) and (40), we can finally get:

DT V (qt, q t)

1 4Et,x|x0 sθ(x, t) x log q t(x) 2 2 + C1 + q

2 2 exp{ C2 δ 2 2} + ϵre

where C1 = (Lu

4 + 2M 2) R T t β(s)ds and C2 = 1 8(1 Πt s=1(1 βs)).

C.2 Connection between the deviated-reconstruction loss and the density gradient loss for DDPM

Theorem 3. Consider adversarial sample x0 := x0 + δ, where x0 is the clean example and δ is the perturbation. pt(x),p t(x),qt(x),q t(x) are the distribution of xt,x t, xt, x t where x t represents the reconstruction of xt in the reverse process. Given a DDPM parameterized by β( ) and the function approximator sθ with the mild assumptions that sθ(x, t) ϵ(xt, t) 2 2 Lu, DT V (pt, p t) ϵre, and a bounded score function by M (i.e., ϵ(x, t) 2 M) where ϵ( , ) represents the mapping function of the true perturbation, we have:

DT V (qt, q t)

v u u u t2 2 exp{ C2

k=t+1 λ(k, t) sθ( x k, k) ϵ( x k, k) 2 + C1 δ 2 + ( p

k=t+1 λ(k, t)

8 (1 Πt s=1(1 βs)) δ 2 2} + ϵre

where C1 = ΠT s=t+1 p

Πs k=1(1 βk) p

ΠT s=1(1 βs), C2 = 1 Πt s=1(1 βs) 8(1 Πt 1 s=1(1 βs))βt , and

λ(k, t) = βkΠk 1 i=t+1 p

Πi s=1(1 βs) p

1 Πk s=1(1 βs) .

Proof. For ease of notation, we use the notation: αt := 1 βt and αt := Πt s=1αs. From the DDPM sampling process [25], we know that:

x t 1 p t := 1 αt

x t 1 αt 1 αt sθ(x t, t) + σtz (43)

x t 1 q t := 1 αt

x t 1 αt 1 αt sθ( x t, t) + σtz (44)

where σ2 t = 1 αt 1

µt,q and µt,p represent the mean of the distribution q t and p t, respectively. Then from Equations (43) and (44), we have:

µt,q µt,p = 1 αt (µt 1,q µt 1,p) 1 αt αt 1 αt (sθ( x t, t) sθ(x t, t)) (45)

Applying Equation (45) iteratively, we get:

µT,q µT,p = 1 ΠT s=t αs (µt 1,q µt 1,p)

1 αk 1 αkΠT i=k αi (sθ( x k, k) sθ(x k, k))

On the other hand, µT,q µT,p can be formulated explicitly considering the Gaussian distribution at time step T in the diffusion process:

µT,q µT,p = αT ( x0 x0) = αT δ (47)

Combining Equations (46) and (47), we can derive that:

µT,q µT,p 2 (48)

= ΠT s=t+1 αs αT δ 2 +

(1 αk)Πk 1 i=t+1 αi 1 αk sθ( x k, k) sθ(x k, k)) 2 (49)

ΠT s=t+1 αs αT δ 2 +

(1 αk)Πk 1 i=t+1 αi 1 αk

sθ( x k, k) ϵ(x k, k) 2 + ϵ(x k, k) sθ(x k, k)) 2 (50)

ΠT s=t+1 αs αT δ 2 + p

k=t+1 λ(k, t) +

k=t+1 λ(k, t) sθ( x k, k) ϵ( x k, k) 2 + ϵ( x k, k) ϵ(x k, k) 2

ΠT s=t+1 αs αT δ 2 + ( p

k=t+1 λ(k, t) +

k=t+1 λ(k, t) sθ( x k, k) ϵ( x k, k) 2 (52)

where λ(k, t) = (1 αk)Πk 1 i=t+1 αi 1 αk .

We then leverage the closed form formulation of the Hellinger distance between two Gaussian distributions [14] parameterized by µ1, Σ1, µ2, Σ2:

H2(N(µ1, Σ1), N(µ2, Σ2)) = 1 det(Σ1)1/4det(Σ2)1/4

det Σ1 + Σ2

8(µ1 µ2)T Σ1 + Σ2

(53) Applying it to distribution p t and q t, we have:

H2(p t, q t) = 1 exp{ 1 αt 8(1 αt 1)βt µt,q µt,p 2 2}

k=t+1 λ(k, t) +

k=t+1 λ(k, t) sθ( x k, k) ϵ( x k, k) 2

where C1 = ΠT s=t+1 αs αT and C2 = 1 αt 8(1 αt 1)βt . Finally, it follows that:

DT V (qt, q t) DT V (qt, pt) + DT V (pt, p t) + DT V (q t, p t) (56)

8 (1 αt) δ 2 2} + ϵre +

2H(q t, p t) (57)

v u u u t2 2 exp{ C2

C1 δ 2 + ( p

k=t+1 λ(k, t) +

k=t+1 λ(k, t) sθ( x k, k) ϵ( x k, k) 2

8 (1 αt) δ 2 2} + ϵre (59)

v u u u t2 2 exp{ C2

k=t+1 λ(k, t) sθ( x k, k) ϵ( x k, k) 2 + C1 δ 2 + ( p

k=t+1 λ(k, t)

8 (1 Πt s=1(1 βs)) δ 2 2} + ϵre (61)

where C1 = ΠT s=t+1 p

Πs k=1(1 βk) p

ΠT s=1(1 βs), C2 = 1 Πt s=1(1 βs) 8(1 Πt 1 s=1(1 βs))βt , and

λ(k, t) = βkΠk 1 i=t+1 p

Πi s=1(1 βs) p

1 Πk s=1(1 βs) .

D Experiment

D.1 Pseudo-code of Diff Attack

Given an input pair (x, y) and the perturbation budget, we notate L := Lcls + λLdev (Equation (8)) the surrogate loss, Π the projection operator given the perturbation budget and distance metric, η the step size, α the momentum coefficient, Niter the number of iterations, and W the set of checkpoint iterations. Concretely, we select Lcls as the cross-entropy loss in the first round and DLR loss in the second round following [34]. The gradient of the surrogate loss with respect to the samples is computed by forwarding the samples and backwarding the gradients for multiple times and taking the average to tackle the problem of randomness. We also optimize all the samples, including the misclassified ones, to push them away from the decision boundary. The gradient can be computed with our segment-wise forwarding-backwarding algorithm in Section 3.3, which enables Diff Attack to be the first fully adaptive attack against the DDPM-based purification defense. The complete procedure is provided in Algorithm 2.

D.2 Experiment details

We use pretrained score-based diffusion models [49] on CIFAR-10, guided diffusion models [15] on Image Net, and DDPM [25] on CIFAR-10 to purify the images following the literature [34, 4, 55, 56]. Due to the high computational cost, we follow [34] to randomly select a fixed subset of 512 images sampled from the test set to evaluate the robust accuracy for fair comparisons. We implement Diff Attack in the framework of Auto Attack [12], and we use the same hyperparameters. Specifically, the number of iterations of attacks (Niter) is 100, and the number of iterations to approximate the gradients (EOT) is 20. The momentum coefficient α is 0.75, and the step size η is initialized with 2ϵ where ϵ is the maximum ℓp-norm of the perturbations. The balance factor λ between the classificationguided loss and the deviated-reconstruction loss in Equation (8) is fixed as 1.0 and α( ) is set the reciprocal of the size of sampled time steps in the evaluation. We consider ϵ = 8/255 and ϵ = 4/255 for ℓ attack and ϵ = 0.5 for ℓ2 attack following the literature [11, 12].

We use randomly selected 3 seeds and report the averaged results for evaluations. CIFAR-10 is under the MIT license and Image Net is under the BSD 3-clause license.

More details of baselines. In this part, we illustrate more details of the baselines 1) SOTA attacks against score-based diffusion adjoint attack (Adj Attack) [34], 2) SOTA attack against DDPM-based

Algorithm 2 Diff Attack

1: Input: L := Lcls + λLdev, Π, (x, y), η, α, Niter, W = {w0, . . . , wn} 2: Output: x 3: x(0) x 4: x(1) Π x(0) + η x(0)L( x(0), y)

5: Lmax max{L( x(0), y), L( x(1), y)} 6: x x(0) if Lmax L( x(0), y) else x x(1)

7: for k = 1 to Niter 1 do 8: z(k+1) Π x(k) + η x(k)L( x(k), y)

9: x(k+1) Π x(k) + α(z(k+1) x(k)) + (1 α)( x(k) x(k 1))

10: if L( x(k+1), y) > Lmax then 11: x x(k+1) and Lmax L( x(k+1), y) 12: end if 13: if k W then 14: η η/2 15: end if 16: end for

diffusion Diff-BPDA attack [4], 3) SOTA black-box attack SPSA [53] and square attack [1], and 4) specific attack against EBM-based purification joint attack (score/full) [60]. Adj Attack selects the surrogate loss L as the cross-entropy loss and DLR loss and leverages the adjoint method [28] to efficiently backpropagate the gradients through SDE. The basic idea is to obtain the gradients via solving an augmented SDE. For the SDE in Equation (4), the augmented SDE that computes the gradients L/ x T of back=propagating through it is given by:

, f, g, w, 0, T

x 0 is the gradient of the objective L w.r.t. the x 0, and

f([x; z], t) =

g(t) = grev(t)1d 0d

w(t) = w(1 t) w(1 t)

with 1d and 0d representing the d-dimensional vectors of all ones and all zeros, respectively and frev(x, t) := 1

2β(t)x β(t) x log pt(x), grev(t) := p

SPSA attack approximates the gradients by randomly sampling from a pre-defined distribution and using the finite-difference method. Square attack heuristically searches for adversarial examples in a low-dimensional space with the constraints of the perturbation pattern (i.e., constraining the square shape of the perturbation). Joint attack (score) updates the input by the weighted average of the classifier gradient and the output of the score estimation network (i.e., the gradient of log-likelihood with respect to the input), while joint attack (full) leverages the classifier gradients and the difference between the input and the purified samples. The update of the joint attack (score) is formulated as follows: x x + η (λ sign(sθ( x)) + (1 λ )sign( x L(F(P( x)), y)) (64)

The update of the joint attack (full) is formulated as follows:

x x + η (λ sign(F(P( x)) x) + (1 λ )sign( x L(F(P( x)), y)) (65)

where η is the step size and λ the balance factor fixed as 0.5 in the evaluation.

Table 5: Comparisons of gradient backpropagation time per batch(second)/Memory cost (MB) between [4] and Diff Attack. We evaluate on CIFAR-10 with Wide Res Net-28-10 with batch size 16.

Method T = 5 T = 10 T = 15 T = 20 T = 30 T = 1000

[4] 0.45/14,491 0.83/23,735 1.25/32,905 1.80/38,771 Diff Attack 0.44/2,773 0.85/2,731 1.26/2,805 1.82/2,819 2.67/2,884 85.81/3,941

Table 6: The clean / robust accuracy (%) of diffusion-based purification with different diffusion lengths T under Diff Attack. The evaluation is done on Image Net with Res Net-50 under ℓ attack (ϵ = 4/255).

T = 50 T = 100 T = 150 T = 200

71.88 / 12.46 68.75 / 24.62 67.79 / 28.13 65.62 / 26.83

D.3 Additional Experiment Results

Efficiency evaluation. We evaluate the wall clock time per gradient backpropagation of the segmentwise forwarding-backwarding algorithm for different diffusion lengths and compare the time efficiency as well as the memory costs with the standard gradient backpropagation in previous attacks [4]. The results in Table 5 indicate that the segment-wise forwarding-backwarding algorithm consumes comparable wall clock time per gradient backpropagation compared with [4] and achieves a much better tradeoff between time efficiency and memory efficiency. The evaluation is done on an RTX A6000 GPU with 49,140 MB memory. In the segment-wise forwarding-backwarding algorithm, we require one forward pass and one backpropagation pass in total for the gradient computation, while the standard gradient backpropagation in [4] requires one backpropagation pass. However, since the backpropagation pass is much more expensive than the forward pass [36], our segmentwise forwarding-backwarding algorithm can achieve comparable time efficiency while significantly reducing memory costs.

More ablation studies on Image Net. We conduct more evaluations on Image Net to consolidate the findings in CIFAR-10. We evaluate the clean/robust accuracy (%) of diffusion-based purification with different diffusion lengths T under Diff Attack. The results in Table 6 indicate that 1) the clean accuracy of the purification defenses negatively correlates with the diffusion lengths, and 2) a moderate diffusion length benefits the robust accuracy under Diff Attack.

Tansferability of Diff Attack. ACA [10] and Diff-PGD attack [58] explore the transferability of unrestricted adversarial attack, which generates realistic adversarial examples to fool the classifier and maintain the photorealism. They demonstrate that this kind of semantic attack transfers well to other models. To explore the transferability of adversarial examples by ℓp-norm-based Diff Attack, we evaluate the adversarial examples generated on score-based purification with Res Net-50 on defenses with pretrained WRN-50-2 and Dei T-S. The results in Table 7 indicate that Diff Attack also transfers better than Adj Attack and achieves much lower robust accuracy on other models.

Ablation study of balance factor λ. As shown in Equation (11), λ controls the balance of the two objectives. A small λ can weaken the deviated-reconstruction object and make the attack suffer more from the vanishing/exploded gradient problem, while a large λ can downplay the guidance of the classification loss and confuse the direction towards the decision boundary of the classifier. The results in Table 8 show that selecting λ as 1.0 achieves better tradeoffs empirically, so we fix it as 1.0 for experiments.

D.4 Visualization

In this section, we provide the visualization of adversarial examples generated by Diff Attack. Based on the visualization on CIFAR-10 and Image Net with different network architectures, we conclude that the perturbation generated by Diff Attack is stealthy and imperceptible to human eyes and hard to be utilized by defenses.

Table 7: Robust accuracy (%) with ℓ attack (ϵ = 8/255) against score-based diffusion purification on CIFAR-10. The adversarial examples are optimized on the diffusion purification defense with pretrained Res Net-50 and evaluated on defenses with other types of models including WRN-50-2 and Dei T-S.

Res Net-50 WRN-50-2 Dei T-S

Adj Attack 40.93 52.37 54.53 Diff Attack 28.13 37.28 39.62

Table 8: The impact of different loss weights λ on the robust accuracy (%). We perform ℓ (ϵ = 8/255) against score-based diffusion purification on CIFAR-10 with Wide Res Net-28-10 and diffusion length T = 0.1.

λ = 0.1 λ = 1.0 λ = 10.0

54.69 46.88 53.12

Adversarial

Ground truth label

airplane ship automobile frog truck automobile frog automobile

deer truck ship bird ship truck dog truck

Figure 5: Visualization of the clean images and adversarial samples generated by Diff Attack on CIFAR-10 with ℓ attack (ϵ = 8/255) against score-based purification with Wide Res Net-28-10.

Adversarial

Ground truth label

deer deer airplane bird dog truck truck deer

horse dog automobile frog cat automobile horse

Figure 6: Visualization of the clean images and adversarial samples generated by Diff Attack on CIFAR-10 with ℓ attack (ϵ = 8/255) against score-based purification with Wide Res Net-70-16.

Adversarial

Ground truth label

urchin dock Irish terrier cabbage butterfly

coral reef ship Norwich terrier ringlet

Figure 7: Visualization of the clean images and adversarial samples generated by Diff Attack on Image Net with ℓ attack (ϵ = 4/255) against score-based purification with Wide Res Net-50-2.

Adversarial

Ground truth label

mushroom safe minivan stove

corn mailbox moving van oven

Figure 8: Visualization of the clean images and adversarial samples generated by Diff Attack on Image Net with ℓ attack (ϵ = 4/255) against score-based purification with Dei T-S.

Adversarial

Ground truth label

screen home theater candle coffee

computer desk book case red wine

Figure 9: Visualization of the clean images and adversarial samples generated by Diff Attack on Image Net with a larger perturbation radius: ℓ attack (ϵ = 8/255) against score-based purification with Res Net-50.