# greedyfool_distortionaware_sparse_adversarial_attack__8d7c44ba.pdf

Greedy Fool: Distortion-Aware Sparse Adversarial Attack

Xiaoyi Dong1 , Dongdong Chen2 , Jianmin Bao2, Chuan Qin1, Lu Yuan2, Weiming Zhang1, Nenghai Yu1, Dong Chen2

1University of Science and Technology of China 2Microsoft Research {dlight@mail., qc94@mail., zhangwm@, ynh@ }.ustc.edu.cn cddlyf@gmail.com , {jianbao, luyuan, doch }@microsoft.com

Modern deep neural networks(DNNs) are vulnerable to adversarial samples. Sparse adversarial samples are a special branch of adversarial samples that can fool the target model by only perturbing a few pixels. The existence of the sparse adversarial attack points out that DNNs are much more vulnerable than people believed, which is also a new aspect for analyzing DNNs. However, current sparse adversarial attack methods still have some shortcomings on both sparsity and invisibility. In this paper, we propose a novel two-stage distortion-aware greedy-based method dubbed as Greedy Fool". Speciﬁcally, it ﬁrst selects the most effective candidate positions to modify by considering both the gradient(for adversary) and the distortion map(for invisibility), then drops some less important points in the reduce stage. Experiments demonstrate that compared with the startof-the-art method, we only need to modify 3 fewer pixels under the same sparse perturbation setting. For target attack, the success rate of our method is 9.96% higher than the start-of-the-art method under the same pixel budget. Code can be found at https://github.com/Light DXY/Greedy Fool.

1 Introduction

Despite the success of current deep neural networks (DNNs), they are shown to be vulnerable to adversarial samples [41, 17, 6, 36, 24, 31, 4, 18, 42, 13, 14, 47]. An adversarial sample is a carefully crafted image to fool the target network by adding small perturbations onto the original clean image. On one hand, the existence of adversarial samples raises the security threat of DNNs, especially these DNNs used in current daily life [23]. On the other hand, adversarial samples can help us to disclose the weakness of DNNs, and further understand the mechanism of DNNs.

To understand the weakness and mechanism of DNNs comprehensively, it is necessary to achieve various kinds of adversarial attacks. In most cases, adversarial perturbations are constrained by a lpnorm distance and can be roughly categorized into two types: dense attack and sparse attack. For dense attack, the attack methods perturb all the pixels of the image under the l or l2 constrain [10, 6, 41]. For sparse attack, the attack methods only perturb a few pixels under the l0 constrain [34, 30, 11].

Since the pioneering work [41], many dense attack methods have been proposed and achieved impressive results [15, 35, 5, 10, 19, 43, 33, 44, 45, 8, 9]. However, different from l or l2 constraint, it is a NP-hard problem to generate adversarial noise under l0 constraint. To address this problem, many previous works have been proposed under both white-box and black-box setting. For white-box attack, JSMA [34] proposed to select the most effective pixels that inﬂuence the model decision based

Work done during an internship at Microsoft Research Asia. Dongdong Chen is the corresponding author.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Ours Sparse Fool σ-PGD0

Figure 1: Adversarial samples and their corresponding perturbations generated by different sparse adversarial attack methods. Our method achieves the best performance of sparsity and invisibility.

on a saliency map. Sparse Fool [30] converted the problem into an l1 constraint problem, PGD0 [11] proposed to project the adversarial noise generated by PGD [27] to the l0-ball to achieve the l0 version PGD. When it comes to black-box attack, One Pixel Attack [39] and Pointwise Attack [37] proposed to apply evolutionary algorithm to achieve extremely sparse perturbations. Loc Search Adv [32] use the local search to realize sparse attack. Recently, Corner Search [11] propose traversing all the pixels and selects the most effective subset of pixels to realize attack. In this paper, we focus on the white-box attack.

However, these existing methods still suffer from many issues. JSMA [34] is suffering from high complexity that can hardly achieve adversarial attacks on high resolution images. Sparse Fool [30] is not sparse enough and can not achieve a target attack. For PGD0 [11], the number of perturbation pixels is deﬁned before the attack and these pixels must be used for the attack, so it perturbs many redundant pixels and is probably not ﬂexible for real scenarios.

To address these issues, we propose a novel greedy-based method dubbed as Greedy Fool". It mainly consists of two stages. In the ﬁrst stage, we iteratively choose k most suitable pixels to modify based on the gradient information until a successful attack. In the second stage, we also apply a greedy strategy to drop as many less important pixels as possible to further improve the sparsity.

On the other hand, we notice that sparse attacks are usually clearly visible as previous works presented [30, 11]. Therefore, we introduce a distortion map, which is an invisibility constraint that helps sparse attacks to achieve better invisibility. Instead of using hand-crafted distortion [11] which may ignore the semantic and content information in the image, we propose a framework based on Generative Adversarial Nets(GANs) [16] to learn a precise distortion map of the image. Such a distortion map will be added into the ﬁrst stage of the Greedy Fool" as positional guidance.

Experiments on the CIFAR10 [22] and Image Net [12] dataset show that the sparsity of our method is much better than state-of-the-art methods. For images in the Image Net, when the perturbation threshold is set to 10, our method only needs to perturb 222.5 pixels while Sparse Fool needs to perturb 615 pixels, about 3 more than our method. When relaxing the threshold to 255, our method only needs to modify 27 pixels to achieve an adversarial attack, while Sparse Fool needs 80.50 pixels. Meanwhile, our method demonstrates a higher target attack success rate than state-of-the-art methods. With a 200 pixels (0.22% pixels) budget on Image Net, the target attack success rate of our Greedy Fool is 15.52%, while PGD0 is only 5.56%. When the budget increases to 1000 pixels (1.1% pixels), the success of our Greedy Fool increases to 84.86%, while it is only 64.20% for PGD0.

To summarize, the main contributions of this paper are threefold: 1) We propose a novel two-stage greedy-based adversarial attack method to further increase the sparsity. 2) To generate adversarial samples with better invisibility, we leverage GAN to generate a distortion cost map as the guidance to ﬁnd proper pixels to modify. 3) Extensive experiments have demonstrated the superb performance of our method. For both non-target attack and target-attack, our method outperforms previous methods by large margins.

2 Distortion-Aware Sparse Adversarial Attack

Problem Analysis. We denote x as the source image and y as its corresponding ground-truth label. Let H be the target model, H(x)c is the output logit value for class c. For a clean input x, argmaxc H(x)c = y. An adversarial sample xadv = x + r is generated by adding noise r to the original image x and satisﬁes argmaxc H(xadv)c = y. Meanwhile, the adversarial noise r should be small enough to guarantee the adversarial sample is similar to the original one. In most cases, it is

measured by a lp-norm. In this paper, under the l0 constraint, it should be

minr|r|0 subject to argmaxc H(xadv)c = y (1)

However, it is a NP-hard problem to solve. So we resort to a greedy algorithm to ﬁnd a local optimal result with fast speed. More speciﬁcally, we select k current optimal pixels in each iteration based on the gradient information until a successful attack. But local optimal is not always the global optimal. To get better sparsity, we apply the greedy search again to ﬁnd the unnecessary points in the selected points sets. Meanwhile, for better invisibility, we use a distortion value to instruct whether a pixel is suitable for modifying or not. Rather than a hand-crafted deﬁnition of each pixel s distortion, we propose using a generative adversarial network to learn a proper probability deﬁnition.

2.1 Greedy Fool Adversarial Attack.

In the ﬁrst stage, we iteratively increase the number of perturbed pixels until we ﬁnd an adversarial sample. For clarity, we use a binary mask m to denote whether a pixel is selected and initialize m with all zeros. In each iteration, we run a forward-backward pass with the latest modiﬁed image xadv t to calculate the gradient of the loss function to xadv t . And the pixels with bigger gradient values are regarded to contribute more adversary.

gt = xadv t L(xadv t , y, H)

L(x, y, H) = max(H(x)y maxi =y{H(x)i}, κ) (2)

where L(xadv t , y, H) is the loss function and xadv t L(xadv t , y, H) is the gradient of L(xadv t , y, H) with respect of xadv t . Here we use the loss function from C&W [6] for non-target attack. κ is a conﬁdence factor to control the attack strength, we set κ = 0 by default and enlarge it for better black-box transferability.

To achieve invisibility of the adversarial samples, we introduce a distortion map of an image, where the distortion of a pixel represents the visibility for the modiﬁcation of the pixel, a higher distortion means the pixel modiﬁcation can be more easily observed. We will introduce how to get the distortion map in the next section. Suppose the distortion map is ϱ, where ϱ (0, 1)H W . To achieve a balance between invisibility and sparsity, we calculate a perturbation weight map p from ϱ and select pixels with both p and gt. Formally,

0 ϱi,j τ1 (τ1 ϱi,j)/(τ1 τ2) τ2 < ϱi,j τ1 1 ϱi,j τ2 (3)

g t = p gt (1 m) (4)

where τ1 and τ2 are predeﬁned thresholds and set as the 70, 25 percentile of ϱ by default. Then we add the top k unselected pixels which contain biggest g t values into the target perturbation pixels set and change corresponding values of m into 1. Finally, we update xadv t based on gt and m with the perturb step size α.

xadv t+1 = Clipϵ x(xadv t + α gt m

|gt m|) (5)

Besides the distortion map guidance, there are another two key implementation differences in our method when compared to existing sparse adversarial attack methods [30]. First, we search the sparse adversarial noise in a greedy way for every iteration by using the latest updated gradient, rather than use an imitate direction estimated by other adversarial methods in Sparse Fool [30]. This ensures that every choice at the current iteration is the best and achieves an attack in earlier iterations. Second, Sparse Fool [30] think the adversary contribution of each pixel is linear additive, so it updates each selected pixel only once, while we view it as a non-linear process and update all the selected pixels in each iteration. This also helps us to achieve attack faster. Meanwhile, we notice that sparse attack is sensitive to the direction, so we keep both magnitude and sign of the gradient rather than only use its sign on each pixel. This makes the perturbation more precise and helps us achieve an attack with fewer pixels than previous methods. In the following experiments, we will show the adversary of perturbation is sensitive to its direction when its dimension is small.

Algorithm 1 Greedy Fool Input: Source image x, target model H,distortion map ϱ. Parameter: Max iterations T, threshold ϵ, select number k. Output: adversarial sample xadv

1: Stage 1: Increasing 2: Initialize m 0, t 0, α = ϵ/2. 3: Initialize xadv t x 4: Generate perturbation weight map p from ϱ with Equation. 3 . 5: while t < T and xadv t is not adversarial do 6: gt xadv t L(xadv t , y, H) 7: g t p gt (1 m) 8: d1, d2, ..., dk argtopk(|g t|) 9: md1,d2,...,dk = 1 10: xadv t+1 = Clipϵ x(xadv t + α gt m |gt m|) 11: t t + 1 12: end while 13: 14: Stage 2: Reducing 15: R = {}, xadv xadv t , r xadv x 16: while t < T and {d|d N, rd = 0} R = do

17: found = False. 18: d argmin d R,rd =0 |rd|

19: r r, r d 0. 20: for α = 1 to ϵ and not found do 21: ˆxadv = Clipϵ x(x + α r |r |)

22: if ˆxadv is adversarial then 23: r r , xadv ˆxadv

24: found = True. 25: end if 26: end for 27: if not found then 28: R R {d} 29: end if 30: t t + 1 31: end while 32: return xadv

After the ﬁrst stage, though the selected pixels can already ensure a successful sparse attack, we ﬁnd some pixels can be redundant because of the greedy property. For a more sparse perturbation, it is necessary to minimize the number of perturbed pixels as small as possible. Therefore, we adopt a second reducing stage to achieve better sparsity.

Formally, we maintain a subset R that contains the coordinate of the noise components which are necessary to the adversary and initialize it as an empty set. We suppose the pixels having smaller perturbation values may contribute less adversary than other pixels. So in each iteration, we choose the pixel with the minimum perturbation value from the subset {|rd| rd r, rd = 0, d R} and drop it, then test the adversary of images generated with the dropped noise and a series of step size α from 1 to the threshold ϵ. If at least one of them is still an adversarial sample, we choose the adversarial sample which has the smallest α as our new adversarial sample. If all the generated images are not adversarial samples, we regard the selected pixel as a necessary adversary component and add its coordinate into R. It should be noted that the forward pass of different threshold α can be combined in a batch, so this process is very fast.

Meanwhile, our method can be viewed as a l0 derivative of gradient-based methods, so for a given target label tar, we can achieve target attack by simply replacing the loss function in Eq. 2 with

Ltar(x, tar, H) = max(maxi =tar{H(x)i} H(x)tar, κ) (6)

The whole attack process is shown in Algorithm 1.

2.2 Distortion Map Generation via Generative Adversarial Networks

The purpose of distortion map is to evaluate the modiﬁcation visibility to each pixel of an input image. A pixel with high distortion means the visibility is high when modifying it. Previous method [11] propose a hand-crafted σ-map to represent it. However, the σ-map only takes a 3 3 local patch into consideration and is highly frequency-related, which may ignore the content and semantic information of the image. To address this issue, we propose a new GAN-based framework for distortion map generation. The framework contains a generator G and a discriminator D, as shown in Fig. 2. The generator G and the discriminator D plays a minimax game to get the distortion map.

Regularization

Loss BCE Loss

Generator G Discriminator D

Partial perturb x

Global perturb x

Partial noise

Figure 2: Illustration of our proposed GAN-based framework for distortion map generation. We calculate the partial perturb image x and global perturb image x for training. The BCE loss and an extra regularization loss are used during the training process.

For a given input x with size H W C, the distortion map is generated with the generator G: ϱ = G(x) that ϱ (0, 1)H W . With distortion map ϱ, we calculate the perturbed image x as

x = Clip((1 ϱ) n + x) (7)

where n is a random noise sampled from uniform distribution ( δ, δ). δ is a predeﬁned threshold and we set δ = 8/255 in our experiments.

For the discriminator D, it tries to distinguish the real image x and perturbed image x . In order to make the GAN training more stable, we also build a globally perturbed image x = x + n and add it for training discriminator D. So the loss for discriminator D is:

LD(x) = (2log(D(x)) + log(1 D(x )) + log(1 D(x ))) (8)

The target of the generator G is to generate a proper distortion map p that the perturbed image x and original x is indistinguishable for D, so the generator G will encourage ϱ = 1H W . In this situation, no modiﬁcation will be made to x. So we add a regularization loss to the distortion map which requires the distortion map to be as smaller as possible. This regularization loss will force the generator to learn to add perturbations to the input image and make this perturbed image indistinguishable by the discriminator. The location on the image where perturbations can be added should have a low visibility for discrimination, which indicates the small value of ϱ. The overall loss for generator is: LG(x) = log(D(x ))) + λ|ϱ|, (9) where λ is loss weight for the regularization loss, we choose 1e 5 in our experiment by default. Model architecture and training detail please refer to the supplementary materials.

3 Experiments

3.1 Sparsity Evaluation

We ﬁrst evaluate the sparsity under different perturbation thresholds. On the CIFAR10 [22] dataset, we compare our methods with recent works: JSMA [34], Sparse Fool [30], and PGD0 [11] (σ-PGD0 for settings ϵ = 255). For the Image Net [12] dataset, since JSMA is too slow to achieve adversarial attack for large images, we only compare with Sparse Fool and PGD0. In the following experiments, we use the ofﬁcial implementation of Sparse Fool [1] and PGD0 [2] and follow their default settings. For JSMA, we use the implementation from Fool Box2.4 [3]. For our Greedy Fool, we set the select number k = 1 when ϵ 128. When ϵ < 128, we initialize k with 1 and increase 1 after each iteration for faster speed. As we introduced in Sec. 1, PGD0 needs a pre-deﬁned number of perturbation pixels and can only calculate the fooling rate under such a pre-deﬁned number(we name it as static evaluation and report it as the m pixels fooling rate in the following). While other methods perturb image with a dynamic pixel number and run until a successful attack or iteration upper bound, the evaluation metric of these methods is the mean and median of the modiﬁed pixel numbers( we name it as dynamic evaluation and report it with mean, median of perturbed pixel number, and total fooling rate in the following). For a fair comparison, we report the result on both metrics. In the following

Table 1: Non-target attack sparsity comparison on Image Net dataset. m Pixels Fooling Rate means the fooling rate when only allow to perturb at most m pixels.

Threshold Method Dynamic Evaluation Static Evaluation Mean Median Fooling Rate(%) m Pixels Fooling Rate(%) 10 20 50 100 200

PGD0 - - - 28.26 38.77 61.11 78.79 92.61 σ-PGD0 - - - 17.52 22.56 37.78 48.13 61.88 Sparse Fool 147.61 80.50 100.00 16.10 25.03 38.08 56.04 75.22 GF(ours) 62.13 27.00 100.00 29.60 42.65 67.24 82.87 94.60 10 20 50 100 200

σ-PGD0 - - - 12.76 17.97 25.78 36.29 45.12 Sparse Fool 172.24 94.00 100.00 14.22 20.60 35.82 52.00 77.21 GF(ours) 89.27 41.00 100.00 22.76 34.36 56.61 73.70 88.28 50 100 200 500 1000

σ-PGD0 - - - 5.71 9.29 13.75 21.19 28.43 Sparse Fool 1061.74 615.00 100.00 10.80 17.82 25.02 43.20 65.25 GF(ours) 552.11 222.50 100.00 21.13 32.46 47.20 70.59 85.57

Table 2: Non-target attack sparsity comparison on CIFAR10 dataset. m Pixels Fooling Rate means the fooling rate when only allow to perturb at most m pixels.

Threshold Method Dynamic Evaluation Static Evaluation Mean Median Fooling Rate(%) m Pixels Fooling Rate(%) 1 2 5 10 20

JSMA 18.51 13 100.00 4.81 9.67 23.59 42.06 67.84 PGD0 - - - 15.22 25.82 50.93 75.46 93.73 σ-PGD0 - - - 7.32 10.00 22.24 40.93 64.78 Sparse Fool 17.29 13 100.00 6.52 11.31 23.88 43.10 69.32 GF(ours) 5.98 4 100.00 12.41 29.06 56.26 79.30 95.77 5 10 15 20 30

σ-PGD0 - - - 11.71 22.17 29.17 33.84 43.28 Sparse Fool 18.99 14 100.00 20.30 38.25 52.81 65.74 80.47 GF(ours) 10.53 8 100.00 42.01 67.13 81.86 89.99 97.13 30 40 50 60 100

σ-PGD0 - - - 5.58 7.20 7.87 8.49 10.84 Sparse Fool 163.54 128 100.00 15.78 19.94 25.53 27.04 41.07 GF(ours) 47.82 34 100.00 42.26 52.17 60.27 67.70 85.39

experiments, we generate adversarial samples by attacking an Inception-v3 [40] model pretrained on Image Net [12] dataset. For CIFAR10 dataset, we use a pretrained network in network(NIN) [25] model as [11]. To make an accurate comparison, we generate adversarial samples with 5000 images randomly selected from the Image Net validation set and 10000 images from the CIFAR10 test set.

Non-target Attack Result. Non-target attack results on both Image Net [12] and CIFAR10 dataset are shown in Table 1 and Table 2. For Image Net, when ϵ = 255, for dynamic evaluation, the median perturbation number of our Greedy Fool is only 27(0.03%pixels) to achieve attack, while Sparse Fool needs to perturb 80.5 pixels, nearly 3 more than us. As the threshold ϵ decreases, the perturbation number needed for both methods increases, while our method is still better than Sparse Fool with large margins. When it comes to static evaluation, our Greedy Fool still has the best performance. We also notice the fooling rate of σ-PGD0 is much lower than PGD0, when m = 100 and ϵ = 255, the fooling rate of PGD0 is 78.79%, while σ-PGD0 is only 48.13%. In the meantime, the fooling rate of our Greedy Fool is 82.87%.

When it comes to the CIFAR10 dataset, we ﬁnd that our method still outperforms other methods in most settings. The only exception is the static evaluation with m = 1 and ϵ = 255 and this is due to the restart strategy of PGD0. But as m increases, the fooling rate of PGD0 increases slowly, while our Greedy Fool increases rapidly and becomes better than PGD0.

Target Attack Result. As Sparse Fool can only achieve non-target attack, here we compare target attack results with PGD0 on both CIFAR and Image Net dataset in Table. 4. Though target attack

Table 3: Speed comparison on both the CIFAR10 and Image Net dataset.

Method JSMA PGD0 σ-PGD0 Sparse Fool Greedy Fool(ours) CIFAR10 1.192 0.169 0.179 0.291 0.114 Image Net - 3.232 3.731 8.380 5.253

Table 4: Target attack success rate on both CIFAR10 and Image Net dataset. m Pixels Fooling Rate means the fooling rate when only allow to perturb at most m pixels

Threshold Method CIFAR10 m pixels Fooling Rate(%) Image Net m Pixels Fooling Rate(%) 1 10 20 50 100 200 400 1000 2000 4000

PGD0 1.38 30.94 63.88 96.50 99.44 5.56 22.50 64.20 83.89 93.10 σ-PGD0 0.67 9.10 22.13 55.18 74.33 0.65 1.20 5.45 15.55 25.52 GF(ours) 2.21 32.91 67.49 98.21 99.98 15.52 43.63 84.86 93.82 98.00

ϵ = 100 σ-PGD0 0.33 3.05 5.59 19.17 35.77 0.00 0.18 0.37 2.22 6.82 GF(ours) 1.53 21.36 48.78 93.09 99.88 9.21 30.80 84.29 95.88 99.20

ϵ = 10 σ-PGD0 0.00 0.16 0.39 0.50 0.94 0.10 0.25 0.15 0.15 0.25 GF(ours) 0.14 2.49 5.62 19.27 45.23 0.20 1.60 8.20 31.24 70.63

is much harder than non-target attack, our Greedy Fool outperforms PGD0 and σ-PGD0 by a large margin. Besides, we ﬁnd it is hard for σ-PGD0 to ﬁnd target-attack adversarial samples. For example, even when the pixel budget m = 4000 and ϵ = 255, its fooling rate is only 25.52%.

Speed comparison. Here we compare the speed of our Greedy Fool with state-of-the-art methods in Table 3. When the input image is small, all the methods can achieve attack quickly and our Greedy Fool only needs 0.114 seconds on average. On the Image Net dataset, our Greedy Fool can still achieve comparable speed to other methods while guaranteeing better sparsity.

Black-box Transfer Attack Result. As we introduced in Sec2.1, κ controls the attack strength that when κ = 0, we stop our attack once the generated adversarial sample is adversary. When κ > 0, we keep adding pixels to increase the attack conﬁdence until the logit difference maxi =y{H(x)i} H(x)y > κ. It is obvious that κ is contradiction with the Reduce stage of our Greedy Fool, so we only use the Increasing stage of Greedy Fool to evaluate the inﬂuence of κ. Here we use the black-box transferability to evaluate the adversary that higher fooling rate indicates better adversary. In this section, we compare with Sparse Fool on the Image Net dataset. Adversarial samples are generated by attacking a Dense Net161 [21] model with ϵ = 255 and test the fooling rate on VGG16 [38] and Resnet50 [20] model, all the models are pretrained on Image Net dataset.

Results are reported in Table.5, we ﬁnd that when κ = 0, the black-box fooling rates of Sparse Fool are 15.67% and 26.00% respectively, similar with the result of our Greedy Fool which is 17.00% and 23.33%, while the median pixel perturbation number of our Greedy Fool is only 32.00, nearly 3.5 smaller than Sparse Fool. With the increase of κ, both the perturbed pixels number and the transferability increases. When κ = 6, the transferability of our methods is nearly 2 better than Sparse Fool, while the perturbed pixel number is still smaller than it. The above results prove that our Greedy Fool is efﬁcient and ﬂexible. More results please refer to the supplemental material.

3.2 Invisibility Evaluation.

For adversarial samples, invisibility is an important property. It can be analyzed from two aspects: invisibility to human eyes and invisibility to machines. Since the adversarial perturbations generated by existing methods are already very small, human invisibility is easy to achieve. But for machine invisibility, it is much more challenging. There are many recent works [5, 7, 46, 28, 29] that use powerful classiﬁcation models to detect adversarial samples. In this paper, we consider two statisticsbased methods as the metric of machine invisibility. The ﬁrst is the state-of-the-art steganalysis based adversarial detection method SRM [26]. Another is trying to train a powerful binary CNN classiﬁer to separate the generated adversarial samples from clean images. The underlying motivation is that, if the adversarial samples are invisible enough, the classiﬁer will be difﬁcult to converge and results in random guess (50% accuracy). More details are given in the supplementary materials.

In Table 6, we compare our method not only with sparse attack methods but also one strong l2 baseline C&W [6] and l baseline I-FGSM [23]. For C&W, we use 4 search steps and 25/100

Table 5: Black-box transferability on Image Net dataset. κ is the conﬁdence factor used in our loss function Eq.2. Here FR denote the fooling rate and * means whie-box attack results.

Method Mean Median FR Dense Net161(%) FR Res Net50(%) FR VGG16(%) Sparse Fool 187.39 111.00 100.00* 15.67 26.00 Greedy Fool (κ = 0) 47.79 32.00 100.00* 17.00 23.33 Greedy Fool (κ = 1) 59.06 41.00 100.00* 18.00 29.00 Greedy Fool (κ = 2) 72.37 53.50 100.00* 23.33 29.67 Greedy Fool (κ = 3) 85.37 65.50 100.00* 24.67 35.67 Greedy Fool (κ = 4) 99.57 75.00 100.00* 27.67 40.00 Greedy Fool (κ = 5) 113.43 85.50 100.00* 30.33 45.00 Greedy Fool (κ = 6) 127.48 97.00 100.00* 32.33 48.00

Table 6: Machine invisibility comparison with the SRM detection rate and binary CNN classiﬁer Accuracy metric. All adversarial samples are generated by non-target attacking an pretrained Inception-v3 model with image size 299 299.

Attack Method Perturbation Fooling Rate(%) SRM Det Rate(%) Classiﬁer Acc(%) L0 L2 L I-FGSM(ϵ = 1) 80372.00 0.78 1.00 94.60 99.42 86.15 I-FGSM(ϵ = 2) 83420.00 1.29 2.00 98.60 99.64 96.43 I-FGSM(ϵ = 4) 87870.05 2.46 4.00 100.00 99.66 98.74 C&W 4 25 78018.50 0.97 2.00 94.81 99.52 87.59 C&W 4 100 26943.00 0.46 4.99 100.00 94.80 78.84 PGD0 (ϵ = 255) 200.00 7.13 255.00 92.61 87.80 99.90 σ-PGD0 (ϵ = 255) 1000.00 3.13 113.92 90.98 79.00 94.90 Sparse Fool (ϵ = 255) 81.00 2.87 255.00 100.00 74.36 97.66 GF(ours)(ϵ = 255) 27.00 2.66 255.00 100.00 61.54 94.02 Sparse Fool (ϵ = 10) 642.00 0.59 10.00 100.00 73.58 86.79 GF(ours)(ϵ = 10) 222.50 0.51 10.00 100.00 54.00 67.90 GF(ours)(ϵ = 5) 472.00 0.36 5.00 100.00 54.32 59.78 GF(ours)(ϵ = 2.5) 1121.00 0.28 2.50 100.00 57.24 50.00

iterations for each step, and we denote it as C&W 4 25 and C&W 4 100 respectively. When ϵ = 255, we set l0 = 200 for PGD0 and l0 = 1000 for σ-PGD0 to ensure the success rate higher than 90%. When ϵ = 10, even we set l0 = 80000, the success rate is still only 73.56% so we do not report it here. We ﬁnd that I-FGSM with ϵ = 4 is the easiest to be detected as it uses a relatively large perturbation threshold and nearly perturbs all the pixels. Consistent with the observation in [26], SRM is very sensitive to small perturbation. Even for I-FGSM(ϵ = 1), its detection rate is still larger than 99%. For C&W, since the perturbations on some pixels are smaller than 1 and removed by the round operation when the image is saved, it is relatively sparse and has a slightly lower detection rate.When it comes to l0-based methods, as the SRM is specially designed for detecting small while dense perturbations, it is not sensitive to large but sparse noise. For example, for Sparse Fool with ϵ = 255, even if it is obvious to a human, its detection rate is only 74.36%. For PGD0 and σ-PGD0, suffered by its redundant perturbed pixels, its detection rate is 87.80% and 79.00% respectively. For our Greedy Fool, with the distortion map guidance, the detection rate is much lower with 61.54%.

For the classiﬁer-based metric, it works well for most l2, l methods. For sparse attack methods with l0 constraint, when the perturbation is large (ϵ = 255), training a CNN binary classiﬁer can achieve 97.66% accuracy for Sparse Fool and 94.02% for Greedy Fool. While PGD0 and σ-PGD0 also performs poorly for the same reason. As ϵ decreases, the adversarial classiﬁcation accuracy decreases but our Greedy Fool performs better. For example, when ϵ = 10, the classiﬁcation accuracy of Sparse Fool is 86.79%, but our Greedy Fool is only 67.90%. Especially when ϵ = 2.5, the accuracy decreases to 50.00% with random guess.

3.3 Ablation Study

Component Analysis of Greedy Fool. We further analyze the contributions of each part of our method toward sparsity and invisibility. In the following, we denote the increasing stage as Incr , then evaluate the effect of the reducing stage ( Reduce ) and the introduced distortion map ( Dis ). Here we set ϵ = 10 and use the binary CNN classiﬁer for invisibility evaluation.

200 400 600 800 1000 1200

0 20 40 60 80 100

Mean Number Cosine similarity

Figure 3: The inﬂuence of gradient direction by changing different percentiles of gradient into sign.

Table 7: The contribution of each part in Greedy Fool, here FR and Acc denote the fooling rate and classiﬁer accuracy.

Method Mean Median FR(%) Acc(%) Incr 537.10 188.00 100 75.65 Incr + Reduce 426.77 161.00 100 75.35 Incr + Dis 607.22 243.50 100 68.50 Incr + Reduce +Dis 552.11 222.50 100 67.90

From the results shown in Table 7, we observe that with only the Incr part, our method already reaches state-of-the-art sparsity with a modiﬁcation number of 188. Meanwhile, the classiﬁer accuracy is relatively low(similar with C&W 4 100 in Table 6). We believe such a low accuracy beneﬁts from the extremely low perturbation number. When we add the Reduce stage, the mean modiﬁcation number decreases from 537.10 to 426.77. This proves our hypothesis that the result of the greedy strategy in the Incr stage is not optimal. As the modiﬁcation number decreases, we ﬁnd the classiﬁer accuracy also decreases a bit. When we add the Dis part into our method as the modiﬁcation position guidance, the modiﬁcation number increases a lot, it is reasonable because some pixels which inﬂuence the adversary a lot but have a high distortion are abandoned. However, the classiﬁer accuracy decreases from 75.65% to 68.50% signiﬁcantly. This proves our idea that proper modiﬁcation position is crucial to invisibility. Finally, By combining the Reduce and Dis part, our method gets a satisfying result for both sparsity and invisibility.

Sign or Direction. As we mentioned in Sec. 2.1, we ﬁnd when the dimension of adversarial perturbation is small, its adversary is sensitive to its direction. In this section, we evaluate the inﬂuence of the direction change quantitatively.

Formally, we sort the non-zero noise from large to small, then we use the q-th percentile value in it to scale the noise and clamp it to [0, 1], with such operation, we set q-th percents noise to its sign and scale other small noise components proportionally. Therefore, the direction of the new noise is the same as the original gradient direction when q = 0 and degrades to the sign direction when q = 100. Here we set ϵ = 10 and evaluate the change of the direction by the cosine similarity in the following.

Figure 3 shows the inﬂuence of gradient direction change. We ﬁnd that when q = 0, the mean modiﬁcation number is only 537.10. With the increase of q, the modiﬁcation direction deviates from the original direction and the modiﬁcation number increases. When q = 100, we ﬁnd the cosine similarity to be only 0.75 and the modiﬁcation number increases to 1149.39, nearly 2 times larger than the best case when q = 0.

This phenomenon is different from the results of previous dense attack methods, such as I-FGSM, that the performance difference between using sign or not is negligible. We propose a reasonable explanation from the inﬁnitesimal accumulate idea proposed in [17]. [17] believes the existence of adversarial samples is because DNNs are not non-linear enough in the high dimension, so inﬁnitesimal caused by the small perturbations accumulates linearly and ﬁnally changes the prediction. When it comes to global l attack methods, even the sign operation changes the direction, a global perturbation (268203 dims for images with size 299 299 and 3 channels) still accumulates a large number of inﬁnitesimal to change the prediction. But when it comes to our sparse attack, the dimension is much smaller (1611.3 dims for 537.1 perturbed pixels with 3 channels), so inﬁnitesimal accumulated by each dimension is more important and the direction change is more sensitive.

4 Conclusion

In this paper, we propose a novel two-stage distortion-aware greedy-based sparse adversarial attack method Greedy Fool . It can achieve both much better sparsity and invisibility than existing stateof-the-art sparse adversarial attack methods. It basically ﬁrst selects the most effective candidate positions to modify with gradient, then utilizes a reduce stage to drop some less important points. To get better invisibility, we propose using a GAN-based distortion map as guidance in the ﬁrst stage. In the future, we will further investigate how to incorporate the proposed idea into l2 and l based adversarial attacks to help achieve better invisibility.

Acknowledgement.

This work was supported in part by the Natural Science Foundation of China under Grant U1636201, 62002334 and 62072421, Exploration Fund Project of University of Science and Technology of China under Grant YD3480002001, and by Fundamental Research Funds for the Central Universities under Grant WK2100000011.

Broader Impact.

Technically, the newly proposed method boosts existing sparse adversarial attacks with much better sparsity and invisibility, which is indeed big progress. For a broad deep learning community, it will help them understand the fragility of CNN more thoroughly and design more robust network structures.

[1] https://github.com/LTS4/Sparse Fool. [2] https://github.com/fra31/sparse-imperceivable-attacks. [3] https://github.com/bethgelab/foolbox/tree/v2. [4] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. ar Xiv preprint ar Xiv:1707.07397, 2017. [5] Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artiﬁcial Intelligence and Security, pages 3 14, 2017. [6] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39 57. IEEE, 2017. [7] Anirban Chakraborty, Manaar Alam, Vishal Dey, Anupam Chattopadhyay, and Debdeep Mukhopadhyay. Adversarial attacks and defences: A survey. ar Xiv preprint ar Xiv:1810.00069, 2018. [8] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artiﬁcial Intelligence and Security, pages 15 26, 2017. [9] Shuyu Cheng, Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu. Improving black-box adversarial attacks with a transfer-based prior. In Advances in Neural Information Processing Systems, pages 10932 10942, 2019. [10] Francesco Croce and Matthias Hein. A randomized gradient-free attack on relu networks. In German Conference on Pattern Recognition, pages 215 227. Springer, 2018. [11] Francesco Croce and Matthias Hein. Sparse and imperceivable adversarial attacks, 2019. [12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248 255. Ieee, 2009. [13] Xiaoyi Dong, Dongdong Chen, Hang Zhou, Gang Hua, Weiming Zhang, and Nenghai Yu. Self-robust 3d point recognition via gather-vector guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [14] Xiaoyi Dong, Jiangfan Han, Dongdong Chen, Jiayang Liu, Huanyu Bian, Zehua Ma, Hongsheng Li, Xiaogang Wang, Weiming Zhang, and Nenghai Yu. Robust superpixel-guided attentional adversarial attack. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [15] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Xiaolin Hu, Jianguo Li, and Jun Zhu. Boosting adversarial attacks with momentum. ar Xiv, 2017. [16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014. [17] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015. [18] Jiangfan Han, Xiaoyi Dong, Ruimao Zhang, Dongdong Chen, Weiming Zhang, Nenghai Yu, Ping Luo, and Xiaogang Wang. Once a man: Towards multi-target attack via learning multi-target adversarial network once. ar Xiv preprint ar Xiv:1908.05185, 2019. [19] Jamie Hayes and George Danezis. Learning universal adversarial perturbations with generative models. ar Xiv, 2017. [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.

[21] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks, 2018. [22] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. [23] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. ar Xiv, 2016. [24] Cassidy Laidlaw and Soheil Feizi. Functional adversarial attacks. In Advances in Neural Information Processing Systems, pages 10408 10418, 2019. [25] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. ar Xiv preprint ar Xiv:1312.4400, 2013. [26] Jiayang Liu, Weiming Zhang, Yiwei Zhang, Dongdong Hou, Yujia Liu, Hongyue Zha, and Nenghai Yu. Detection based defense against adversarial examples from the steganalysis point of view. In CVPR, pages 4825 4834, 2019. [27] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. [28] Dongyu Meng and Hao Chen. Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 135 147, 2017. [29] Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. ar Xiv preprint ar Xiv:1702.04267, 2017. [30] Apostolos Modas, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Sparsefool: a few pixels make a big difference, 2018. [31] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1765 1773, 2017. [32] Nina Narodytska and Shiva Prasad Kasiviswanathan. Simple black-box adversarial perturbations for deep networks, 2016. [33] Nicolas Papernot, Patrick Mc Daniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In ASIACCS, pages 506 519. ACM, 2017. [34] Nicolas Papernot, Patrick D. Mc Daniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. Co RR, abs/1511.07528, 2015. [35] Omid Poursaeed, Isay Katsman, Bicheng Gao, and Serge Belongie. Generative adversarial perturbations. ar Xiv, 2017. [36] Jérôme Rony, Luiz G. Hafemann, Luiz S. Oliveira, Ismail Ben Ayed, Robert Sabourin, and Eric Granger. Decoupling direction and norm for efﬁcient gradient-based L2 adversarial attacks and defenses. Co RR, abs/1811.09600, 2018. [37] Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the ﬁrst adversarially robust neural network model on mnist. ar Xiv preprint ar Xiv:1805.09190, 2018. [38] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015. [39] Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. One pixel attack for fooling deep neural networks. Co RR, abs/1710.08864, 2017. [40] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818 2826, 2016. [41] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv, 2013. [42] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. Ensemble adversarial training: Attacks and defenses. ar Xiv, 2017. [43] Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. The space of transferable adversarial examples. ar Xiv, 2017. [44] Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks. ar Xiv, 2018. [45] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. Adversarial examples for semantic segmentation and object detection. In ICCV, pages 1369 1378, 2017. [46] Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. ar Xiv, 2017. [47] Hang Zhou, Dongdong Chen, Jing Liao, Kejiang Chen, Xiaoyi Dong, Kunlin Liu, Weiming Zhang, Gang Hua, and Nenghai Yu. Lg-gan: Label guided adversarial network for ﬂexible targeted attack of point cloud based deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.