# generating_adversarial_examples_with_adversarial_networks__09e44bee.pdf

Generating Adversarial Examples with Adversarial Networks

Chaowei Xiao1 , Bo Li2, Jun-Yan Zhu2,3, Warren He2, Mingyan Liu1 and Dawn Song2

1University of Michigan, Ann Arbor 2University of California, Berkeley 3Massachusetts Institute of Technology

Deep neural networks (DNNs) have been found to be vulnerable to adversarial examples resulting from adding small-magnitude perturbations to inputs. Such adversarial examples can mislead DNNs to produce adversary-selected results. Different attack strategies have been proposed to generate adversarial examples, but how to produce them with high perceptual quality and more efﬁciently requires more research efforts. In this paper, we propose Adv GAN to generate adversarial examples with generative adversarial networks (GANs), which can learn and approximate the distribution of original instances. For Adv GAN, once the generator is trained, it can generate perturbations efﬁciently for any instance, so as to potentially accelerate adversarial training as defenses. We apply Adv GAN in both semi-whitebox and black-box attack settings. In semi-whitebox attacks, there is no need to access the original target model after the generator is trained, in contrast to traditional white-box attacks. In black-box attacks, we dynamically train a distilled model for the black-box model and optimize the generator accordingly. Adversarial examples generated by Adv GAN on different target models have high attack success rate under stateof-the-art defenses compared to other attacks. Our attack has placed the ﬁrst with 92.76% accuracy on a public MNIST black-box attack challenge.1

1 Introduction Deep Neural Networks (DNNs) have achieved great successes in a variety of applications. However, recent work has demonstrated that DNNs are vulnerable to adversarial perturbations [Szegedy et al., 2014; Goodfellow et al., 2015]. An adversary can add small-magnitude perturbations to inputs and generate adversarial examples to mislead DNNs. Such maliciously perturbed instances can cause the learning system to misclassify them into either a maliciously-chosen

This work was performed when Chaowei Xiao was at JD.COM and University of California, Berkeley 1https://github.com/Madry Lab/mnist_challenge

target class (in a targeted attack) or classes that are different from the ground truth (in an untargeted attack). Different algorithms have been proposed for generating such adversarial examples, such as the fast gradient sign method (FGSM) [Goodfellow et al., 2015] and optimization-based methods (Opt.) [Carlini and Wagner, 2017b; Liu et al., 2017; Xiao et al., 2018; Evtimov et al., 2017]. Most of the the current attack algorithms [Carlini and Wagner, 2017b; Liu et al., 2017] rely on optimization schemes with simple pixel space metrics, such as L distance from a benign image, to encourage visual realism. To generate more perceptually realistic adversarial examples efﬁciently, in this paper, we propose to train (i) a feed-forward network that generate perturbations to create diverse adversarial examples and (ii) a discriminator network to ensure that the generated examples are realistic. We apply generative adversarial networks (GANs) [Goodfellow et al., 2014] to produce adversarial examples in both the semi-whitebox and blackbox settings. As conditional GANs are capable of producing high-quality images [Isola et al., 2017], we apply a similar paradigm to produce perceptually realistic adversarial instances. We name our method Adv GAN. Note that in the previous white-box attacks, such as FGSM and optimization methods, the adversary needs to have whitebox access to the architecture and parameters of the model all the time. However, by deploying Adv GAN, once the feedforward network is trained, it can instantly produce adversarial perturbations for any input instances without requiring access to the model itself anymore. We name this attack setting semi-whitebox. To evaluate the effectiveness of our attack strategy Adv GAN , we ﬁrst generate adversarial instances based on Adv GAN and other attack strategies on different target models. We then apply the state-of-the-art defenses to defend against these generated adversarial examples [Goodfellow et al., 2015; M adry et al., 2017]. We evaluate these attack strategies in both semi-whitebox and black-box settings. We show that adversarial examples generated by Adv GAN can achieve a high attack success rate, potentially due to the fact that these adversarial instances appear closer to real instances compared to other recent attack strategies. Our contributions are listed as follows.

Different from the previous optimization-based methods, we train a conditional adversarial network to di-

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

rectly produce adversarial examples, which not only results in perceptually realistic examples that achieve state-of-the-art attack success rate against different target models, but also the generation process is more efﬁcient. We show that Adv GAN can attack black-box models by training a distilled model. We propose to dynamically train the distilled model with query information and achieve high black-box attack success rate and targeted black-box attack, which is difﬁcult to achieve for transferability-based black-box attacks. We use the state-of-the-art defense methods to defend against adversarial examples and show that Adv GAN achieves much higher attack success rate under current defenses. We apply Adv GAN on M adry et al. s MNIST challenge (2017) and achieve 88.93% accuracy on the published robust model in the semi-whitebox setting and 92.76% in the black-box setting, which wins the top position in the challenge.

2 Related Work Here we review recent work on adversarial examples and generative adversarial networks. Adversarial Examples A number of attack strategies to generate adversarial examples have been proposed in the white-box setting, where the adversary has full access to the classiﬁer [Szegedy et al., 2014; Goodfellow et al., 2015; Carlini and Wagner, 2017b; Xiao et al., 2018]. Goodfellow et al. propose the fast gradient sign method (FGSM), which applies a ﬁrst-order approximation of the loss function to construct adversarial samples. Formally, given an instance x, an adversary generates adversarial example x A = x + η with L constraints in the untargeted attack setting as η = ϵ sign( xℓf(x, y)), where ℓf( ) is the cross-entropy loss used to train the neural network f, and y represents the ground truth of x. Optimization based methods (Opt) have also been proposed to optimize adversarial perturbation for targeted attacks while satisfying certain constraints [Carlini and Wagner, 2017b; Liu et al., 2017]. Its goal is to minimize the objective function as ||η|| + λℓf(x A, y), where || || is an appropriately chosen norm function. However, the optimization process is slow and can only optimize perturbation for one speciﬁc instance each time. In contrast, our method uses feed-forward network to generate an adversarial image, rather than an optimization procedure. Our method achieves higher attack success rate against different defenses and performs much faster than the current attack algorithms. Independently from our work, feed-forward networks have been applied to generate adversarial perturbation [Baluja and Fischer, 2017]. However, Baluja and Fischer combine the re-ranking loss and an L2 norm loss, aiming to constrain the generated adversarial instance to be close to the original one in terms of L2; while we apply a deep neural network as a discriminator to help distinguish the instance with other real images to encourage the perceptual quality of the generated adversarial examples Black-box Attacks Current learning systems usually do not allow white-box accesses against

Figure 1: Overview of Adv GAN

the model for security reasons. Therefore, there is a great need for black-box attacks analysis. Most of the black-box attack strategies are based on the transferability phenomenon [Papernot et al., 2016], where an adversary can train a local model ﬁrst and generate adversarial examples against it, hoping the same adversarial examples will also be able to attack the other models. Many learning systems allow query accesses to the model. However, there is little work that can leverage query-based access to target models to construct adversarial samples and move beyond transferability. Papernot et al. proposed to train a local substitute model with queries to the target model to generate adversarial samples, but this strategy still relies on transferability. In contrast, we show that the proposed Adv GAN can perform black-box attacks without depending on transferability. Generative Adversarial Networks (GANs) Goodfellow et al. have achieved visually appealing results in both image generation and manipulation [Zhu et al., 2016] settings. Recently, image-to-image conditional GANs have further improved the quality of synthesis results [Isola et al., 2017]. We adopt a similar adversarial loss and image-to-image network architecture to learn the mapping from an original image to a perturbed output such that the perturbed image cannot be distinguished from real images in the original class. Different from prior work, we aim to produce output results that are not only visually realistic but also able to mislead target learning models.

3 Generating Adversarial Examples with Adversarial Networks

3.1 Problem Deﬁnition

Let X Rn be the feature space, with n the number of features. Suppose that (xi, yi) is the ith instance within the training set, which is comprised of feature vectors xi X, generated according to some unknown distribution xi Pdata, and yi Y the corresponding true class labels. The learning system aims to learn a classiﬁer f : X Y from the domain X to the set of classiﬁcation outputs Y, where |Y| denotes the number of possible classiﬁcation outputs. Given an instance x, the goal of an adversary is to generate adversarial example x A, which is classiﬁed as f(x A) = y (untargeted attack), where y denotes the true label; or f(x A) = t (targeted attack) where t is the target class. x A should also be close to the original instance x in terms of L2 or other distance metric.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

3.2 Adv GAN Framework Figure 1 illustrates the overall architecture of Adv GAN, which mainly consists of three parts: a generator G, a discriminator D, and the target neural network f. Here the generator G takes the original instance x as its input and generates a perturbation G(x). Then x+G(x) will be sent to the discriminator D, which is used to distinguish the generated data and the original instance x. The goal of D is to encourage that the generated instance is indistinguishable with the data from its original class. To fulﬁll the goal of fooling a learning model, we ﬁrst perform the white-box attack, where the target model is f in this case. f takes x + G(x) as its input and outputs its loss Ladv, which represents the distance between the prediction and the target class t (targeted attack), or the opposite of the distance between the prediction and the ground truth class (untargeted attack). The adversarial loss [Goodfellow et al., 2014] can be written as:2

LGAN = Ex log D(x) + Ex log(1 D(x + G(x))). (1)

Here, the discriminator D aims to distinguish the perturbed data x+G(x) from the original data x. Note that the real data is sampled from the true class, so as to encourage that the generated instances are close to data from the original class. The loss for fooling the target model f in a targeted attack is: Lf adv = Exℓf(x + G(x), t), (2)

where t is the target class and ℓf denotes the loss function (e.g., cross-entropy loss) used to train the original model f. The Lf adv loss encourages the perturbed image to be misclassiﬁed as target class t. Here we can also perform the untargeted attack by maximizing the distance between the prediction and the ground truth, but we will focus on the targeted attack in the rest of the paper. To bound the magnitude of the perturbation, which is a common practice in prior work [Carlini and Wagner, 2017b; Liu et al., 2017], we add a soft hinge loss on the L2 norm as

Lhinge = Ex max(0, G(x) 2 c), (3)

where c denotes a user-speciﬁed bound. This can also stabilize the GAN s training, as shown in Isola et al. (2017). Finally, our full objective can be expressed as

L = Lf adv + αLGAN + βLhinge, (4)

where α and β control the relative importance of each objective. Note that LGAN here is used to encourage the perturbed data to appear similar to the original data x, while Lf adv is leveraged to generate adversarial examples, optimizing for the high attack success rate. We obtain our G and D by solving the minmax game arg min G max D L. Once G is trained on the training data and the target model, it can produce perturbations for any input instance to perform a semi-whitebox attack. 3.3 Black-box Attacks with Adversarial Networks Static Distillation For black-box attack, we assume adversaries have no prior knowledge of training data or the model

2For simplicity, we denote the Ex Ex Pdata

itself. In our experiments in Section 4, we randomly draw data that is disjoint from the training data of the black-box model to distill it, since we assume the adversaries have no prior knowledge about the training data or the model. To achieve black-box attacks, we ﬁrst build a distilled network f based on the output of the black-box model b [Hinton et al., 2015]. Once we obtain the distilled network f, we carry out the same attack strategy as described in the white-box setting (see Equation (4)). Here, we minimize the following network distillation objective:

arg min f Ex H(f(x), b(x)), (5)

where f(x) and b(x) denote the output from the distilled model and black-box model respectively for the given training image x, and H denotes the commonly used cross-entropy loss. By optimizing the objective over all the training images, we can obtain a model f which behaves very close to the black-box model b. We then carry out the attack on the distilled network. Note that unlike training the discriminator D, where we only use the real data from the original class to encourage that the generated instance is close to its original class, here we train the distilled model with data from all classes.

Dynamic Distillation Only training the distilled model with all the pristine training data is not enough, since it is unclear how close the black-box and distilled model perform on the generated adversarial examples, which have not appeared in the training set before. Here we propose an alternative minimization approach to dynamically make queries and train the distilled model f and our generator G jointly. We perform the following two steps in each iteration. During iteration i:

1. Update Gi given a ﬁxed network fi 1: We follow the white-box setting (see Equation 4) and train the generator and discriminator based on a previously distilled model fi 1. We initialize the weights Gi as Gi 1. Gi, Di = arg min G max D Lfi 1 adv + αLGAN + βLhinge 2. Update fi given a ﬁxed generator Gi: First, we use fi 1 to initialize fi. Then, given the generated adversarial examples x + Gi(x) from Gi, the distilled model fi will be updated based on the set of new query results for the generated adversarial examples against the blackbox model, as well as the original training images. fi = arg minf Ex H(f(x), b(x)) + Ex H(f(x + Gi(x)), b(x + Gi(x))), where we use both the original images x and the newly generated adversarial examples x + s Gi(x) to update f.

In the experiment section, we compare the performance of both the static and dynamic distillation approaches and observe that simultaneously updating G and f produces higher attack performance. See Table 2 for more details.

4 Experimental Results

In this section, we ﬁrst evaluate Adv GAN for both semiwhitebox and black-box settings on MNIST [Le Cun and Cortes, 1998] and CIFAR-10 [Krizhevsky and Hinton, 2009]. We also perform a semi-whitebox attack on the Image Net

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

FGSM Opt. Trans. Adv GAN Run time 0.06s >3h - <0.01s Targeted Attack Ens. Black-box Attack

Table 1: Comparison with the state-of-the-art attack methods. Run time is measured for generating 1,000 adversarial instances during test time. Opt. represents the optimization based method, and Trans. denotes black-box attacks based on transferability.

MNIST(%) CIFAR-10(%) Model A B C Res Net Wide Res Net Accuracy (p) 99.0 99.2 99.1 92.4 95.0 Attack Success Rate (w) 97.9 97.1 98.3 94.7 99.3 Attack Success Rate (b-D)93.4 90.1 94.0 78.5 81.8 Attack Success Rate (b-S)30.7 66.6 87.3 10.3 13.3

Table 2: Accuracy of different models on pristine data, and the attack success rate of adversarial examples generated against different models by Adv GAN on MNIST and CIFAR-10. p: pristine test data; w: semi-whitebox attack; b-D: black-box attack with dynamic distillation strategy; b-S: black-box attack with static distillation strategy.

dataset [Deng et al., 2009]. We then apply Adv GAN to generate adversarial examples on different target models and test the attack success rate for them under the state-of-the-art defenses and show that our method can achieve higher attack success rates compared to other existing attack strategies. We generate all adversarial examples for different attack methods under an L bound of 0.3 on MNIST and 8 on CIFAR-10, for a fair comparison. In general, as shown in Table 1, Adv GAN has several advantages over other white-box and blackbox attacks. For instance, regarding computation efﬁciency, Adv GAN performs much faster than others even including the efﬁcient FGSM, although Adv GAN needs extra training time to train the generator. All these strategies can perform targeted attack except transferability based attack, although the ensemble strategy can help to improve. Besides, FGSM and optimization methods can only perform white-box attack, while Adv GAN is able to attack in semi-whitebox setting. Implementation Details We adopt similar architectures for generator and discriminator with image-to-image translation literature [Isola et al., 2017; Zhu et al., 2017]. We apply the loss in Carlini and Wagner (2017b) as our loss Lf adv = max(maxi =t f(x A)i f(x A)t, κ), where t is the target class, and f represents the target network in the semi-whitebox setting and the distilled model in the black-box setting. We set the conﬁdence κ = 0 for both Opt. and Adv GAN. We use Adam as our solver [Kinga and Adam, 2015], with a batch size of 128 and a learning rate of 0.001. For GANs training, we use the least squares objective proposed by LSGAN [Mao et al., 2017], as it has been shown to produce better results with more stable training.

Models Used in the Experiments For MNIST we generate adversarial examples for three models, where models A and B are used in Tramèr et al. (2017). Model C is the target network architecture used in Carlini and Wagner (2017b). For CIFAR-10, we select Res Net-32 and Wide Res Net-34 [He et al., 2016; Zagoruyko and Komodakis, 2016]. Speciﬁcally, we

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

Figure 2: Adversarial examples generated from the same original image to different targets by Adv GAN on MNIST. Row 1: semiwhitebox attack; Row 2: black-box attack. Left to right: models A, B, and C.On the diagonal, the original images are shown, and the numer on the top denote the targets.

use a 32-layer Res Net implemented in Tensor Flow3 and Wide Res Net derived from the variant of w32-10 wide. 4 We show the classiﬁcation accuracy of pristine MNIST and CIFAR-10 test data (p) and attack success rate of adversarial examples generated by Adv GAN on different models in Table 2. 4.1 Adv GAN in semi-whitebox Setting

We evaluate Adv GAN on f with different architectures for MNIST and CIFAR-10. We ﬁrst apply Adv GAN to perform semi-whitebox attack against different models on MNIST dataset. From the performance of semi-whitebox attack (Attack Rate (w)) in Table 2, we can see that Adv GAN is able to generate adversarial instances to attack all models with high attack success rate. We also generate adversarial examples from the same original instance x, targeting other different classes, as shown in Figures 2. In the semi-whitebox setting on MNIST (a)-(c), we can see that the generated adversarial examples for different models appear close to the ground truth/pristine images (lying on the diagonal of the matrix). In addition, we analyze the attack success rate based on different loss functions on MNIST. Under the same bounded perturbations (0.3), if we replace the full loss function in (4) with L = ||G(x)||2 + Lf adv, which is similar to the objective used in Baluja and Fischer, the attack success rate becomes 86.2%. If we replace the loss function with L = Lhinge + Lf adv, the attack success rate is 91.1%, compared to that of Adv GAN, 98.3%. Similarly, on CIFAR-10, we apply the same semi-whitebox attack for Res Net and Wide Res Net based on Adv GAN, and Figure 3 (a) shows some adversarial examples, which are perceptually realistic. We show adversarial examples for the same original instance targeting different other classes. It is clear that with

3github.com/tensorﬂow/models/blob/master/research/Res Net 4github.com/Madry Lab/cifar10_challenge/blob/master/model.py

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

(a) Semi-whitebox attack

(b) Black-box attack

Figure 3: Adversarial examples generated by Adv GAN on CIFAR10 for (a) semi-whitebox attack and (b) black-box attack. Image from each class is perturbed to other different classes. On the diagonal, the original images are shown.

different targets, the adversarial examples keep similar visual quality compared to the pristine instances on the diagonal.

4.2 Adv GAN in Black-box Setting

Our black-box attack here is based on dynamic distillation strategy. We construct a local model to distill model f, and we select the architecture of Model C as our local model. Note that we randomly select a subset of instances disjoint from the training data of Adv GAN to train the local model; that is, we assume the adversaries do not have any prior knowledge of the training data or the model itself. With the dynamic distillation strategy, the adversarial examples generated by Adv GAN achieve an attack success rate, above 90% for MNIST and 80% for CIFAR-10, compared to 30% and 10% with the static distillation approach, as shown in Table 2. We apply Adv GAN to generate adversarial examples for the same instance targeting different classes on MNIST and randomly select some instances to show in Figure 2 (d)-(f). By comparing with the pristine instances on the diagonal, we can see that these adversarial instances can achieve high perceptual quality as the original digits. Speciﬁcally, the original digit is somewhat highlighted by adversarial perturbations, which implies a type of perceptually realistic manipulation. Figure 3(b) shows similar results for adversarial examples generated on CIFAR-10. These adversarial instances appear photo-realistic compared with the original ones on the diagonal.

4.3 Attack Effectiveness Under Defenses

Facing different types of attack strategies, various defenses have been provided. Among them, different types of adversarial training methods are the most effective. Other categories of defenses, such as those which pre-process an input have mostly been defeated by adaptive attacks [He et al., 2017; Carlini and Wagner, 2017a]. Goodfellow et al. ﬁrst propose adversarial training as an effective way to improve the robustness of DNNs, and Tramèr et al. extend it to ensemble adversarial learning. M adry et al. have also proposed robust networks against adversarial examples based on welldeﬁned adversaries. Given the fact that Adv GAN strives to generate adversarial instances from the underlying true data distribution, it can essentially produce more photo-realistic

adversarial perturbations compared with other attack strategies. Thus, Adv GAN could have a higher chance to produce adversarial examples that are resilient under different defense methods. In this section, we quantitatively evaluate this property for Adv GAN compared with other attack strategies. Threat Model As shown in the literature, most of the current defense strategies are not robust when attacking against them [Carlini and Wagner, 2017b; He et al., 2017]. Here we consider a weaker threat model, where the adversary is not aware of the defenses and directly tries to attack the original learning model, which is also the ﬁrst threat model analyzed in Carlini and Wagner. In this case, if an adversary can still successfully attack the model, it implies the robustness of the attack strategy. Under this setting, we ﬁrst apply different attack methods to generate adversarial examples based on the original model without being aware of any defense. Then we apply different defenses to directly defend against these adversarial instances. Semi-whitebox Attack First, we consider the semiwhitebox attack setting, where the adversary has white-box access to the model architecture as well as the parameters. Here, we replace f in Figure 1 with our model A, B, and C, respectively. As a result, adversarial examples will be generated against different models. We use three adversarial training defenses to train different models for each model architecture: standard FGSM adversarial training (Adv.) [Goodfellow et al., 2015], ensemble adversarial training (Ens.) [Tramèr et al., 2017],5 and iterative training (Iter. Adv.) [M adry et al., 2017]. We evaluate the effectiveness of these attacks against these defended models. In Table 3, we show that the attack success rate of adversarial examples generated by Adv GAN on different models is higher than those of FGSM and Opt. [Carlini and Wagner, 2017b].

Data Model Defense FGSM Opt. Adv GAN

Adv. 4.3% 4.6% 8.0% Ens. 1.6% 4.2% 6.3% Iter.Adv. 4.4% 2.96% 5.6%

Adv. 6.0% 4.5% 7.2% Ens. 2.7% 3.18% 5.8% Iter.Adv. 9.0% 3.0% 6.6%

Adv. 2.7% 2.95% 18.7% Ens. 1.6% 2.2% 13.5% Iter.Adv. 1.6% 1.9% 12.6% C I F A R 10

Adv. 13.10% 11.9% 16.03% Ens. 10.00% 10.3% 14.32% Iter.Adv 22.8% 21.4% 29.47%

Wide Res Net

Adv. 5.04% 7.61% 14.26% Ens. 4.65% 8.43% 13.94 % Iter.Adv. 14.9% 13.90% 20.75%

Table 3: Attack success rate of adversarial examples generated by Adv GAN in semi-whitebox setting, and other white-box attacks under defenses on MNIST and CIFAR-10.

5 Each ensemble adversarially trained model is trained using (i) pristine training data, (ii) FGSM adversarial examples generated for the current model under training, and (iii) FGSM adversarial examples generated for naturally trained models of two architectures different from the model under training.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

MNIST CIFAR-10 Defense FGSM Opt. Adv GAN FGSM Opt. Adv GAN Adv. 3.1% 3.5% 11.5% 13.58% 10.8% 15.96% Ens. 2.5% 3.4% 10.3% 10.49% 9.6% 12.47% Iter.Adv. 2.4% 2.5% 12.2% 22.96% 21.70% 24.28%

Table 4: Attack success rate of adversarial examples generated by different black-box adversarial strategies under defenses on MNIST and CIFAR-10

Method Accuracy (xent loss) Accuracy (cw loss) FGSM 95.23% 96.29% PGD 93.66% 93.79% Opt - 91.69% Adv GAN - 88.93%

Table 5: Accuracy of the Madry Lab public model under different attacks in white-box setting. Adv GAN here achieved the best performance.

Black-box Attack For Adv GAN, we use model B as the black-box model and train a distilled model to perform blackbox attack against model B and report the attack success rate in Table 4. For the black-box attack comparison purpose, transferability based attack is applied for FGSM and Opt. We use FGSM and Opt. to attack model A on MNIST, and we use these adversarial examples to test on model B and report the corresponding classiﬁcation accuracy. We can see that the adversarial examples generated by the black-box Adv GAN consistently achieve much higher attack success rate compared with other attack methods. For CIFAR-10, we use a Res Net as the black-box model and train a distilled model to perform black-box attack against the Res Net. To evaluate black-box attack for optimization method and FGSM, we use adversarial examples generated by attacking Wide Res Net and test them on Res Net to report black-box attack results for these two methods. In addition, we apply Adv GAN to the MNIST challenge. Among all the standard attacks shown in Table 5, Adv GAN achieve 88.93% in the white-box setting. Among reported black-box attacks, Adv GAN achieved an accuracy of 92.76%, outperforming all other state-of-the-art attack strategies submitted to the challenge.

4.4 High Resolution Adversarial Examples

To evaluate Adv GAN s ability of generating high resolution adversarial examples, we attack against Inception_v3 and quantify attack success rate and perceptual realism of generated adversarial examples. Experiment settings. In the following experiments, we select 100 benign images from the DEV dataset of the NIPS 2017 adversarial attack competition [Kurakin et al., 2018]. This competition provided a dataset compatible with Image Net. We generate adversarial examples (299 299 pixels), each targeting a random incorrect class, with L bounded within 0.01 for Inception_v3. The attack success rate is 100%. In Figure 4, we show some randomly selected examples of original and adversarial examples generated by Adv GAN.

(a) Strawberry

(b) Toy poodle

(c) Buckeye

(d) Toy poodle Figure 4: Examples from an Image Net-compatible set, and the labels denote corresponding classiﬁcation results Left: original benign images; right: adversarial images generated by Adv GAN against Inception_v3.

Human Perceptual Study. We validate the realism of Adv GAN s adversarial examples with a user study on Amazon Mechanical Turk (AMT). We use 100 pairs of original images and adversarial examples (generated as described above) and ask workers to choose which image of a pair is more visually realistic. Our study follows a protocol from Isola et al., where a worker is shown a pair of images for 2 seconds, then the worker has unlimited time to decide. We limit each worker to at most 20 of these tasks. We collected 500 choices, about 5 per pair of images, from 50 workers on AMT. The Adv GAN examples were chosen as more realistic than the original image in 49.4% 1.96% of the tasks (random guessing would result in about 50%). This result show that these high-resolution Adv GAN adversarial examples are about as realistic as benign images.

5 Conclusion

In this paper, we propose Adv GAN to generate adversarial examples using generative adversarial networks (GANs). In our Adv GAN framework, once trained, the feed-forward generator can produce adversarial perturbations efﬁciently. It can also perform both semi-whitebox and black-box attacks with high attack success rate. In addition, when we apply Adv GAN to generate adversarial instances on different models without knowledge of the defenses in place, the generated adversarial examples can preserve high perceptual quality and attack the state-of-the-art defenses with higher attack success rate than examples generated by the competing methods. This property makes Adv GAN a promising candidate for improv-

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

ing adversarial training defense methods.

Acknowledgments This work was supported in part by Berkeley Deep Drive, JD.COM, the Center for Long-Term Cybersecurity, and FORCES (Foundations Of Resilient Cyb Er-Physical Systems), which receives support from the National Science Foundation (NSF award numbers CNS-1238959, CNS1238962, CNS-1239054, CNS-1239166, CNS-1422211 and CNS-1616575).

References [Baluja and Fischer, 2017] Shumeet Baluja and Ian Fischer. Adversarial transformation networks: Learning to generate adversarial examples. ar Xiv preprint ar Xiv:1703.09387, 2017. [Carlini and Wagner, 2017a] Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artiﬁcial Intelligence and Security, pages 3 14. ACM, 2017. [Carlini and Wagner, 2017b] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39 57. IEEE, 2017. [Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248 255. IEEE, 2009. [Evtimov et al., 2017] Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song. Robust physical-world attacks on machine learning models. ar Xiv preprint ar Xiv:1707.08945, 2017. [Goodfellow et al., 2014] Ian Goodfellow, Jean Pouget Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672 2680, 2014. [Goodfellow et al., 2015] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770 778, 2016. [He et al., 2017] Warren He, James Wei, Xinyun Chen, Nicholas Carlini, and Dawn Song. Adversarial example defenses: Ensembles of weak defenses are not strong. ar Xiv preprint ar Xiv:1706.04701, 2017. [Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. [Isola et al., 2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.

[Kinga and Adam, 2015] D Kinga and J Ba Adam. A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. [Krizhevsky and Hinton, 2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009. [Kurakin et al., 2018] Alexey Kurakin, Ian Goodfellow, Samy Bengio, Yinpeng Dong, Fangzhou Liao, Ming Liang, Tianyu Pang, Jun Zhu, Xiaolin Hu, Cihang Xie, et al. Adversarial attacks and defences competition. ar Xiv preprint ar Xiv:1804.00097, 2018. [Le Cun and Cortes, 1998] Yann Le Cun and Corrina Cortes. The MNIST database of handwritten digits. 1998. [Liu et al., 2017] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. In ICLR, 2017. [Mao et al., 2017] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2813 2821. IEEE, 2017. [M adry et al., 2017] Aleksander M adry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv:1706.06083 [cs, stat], June 2017. [Papernot et al., 2016] Nicolas Papernot, Patrick Mc Daniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against deep learning systems using adversarial examples. ar Xiv preprint, 2016. [Szegedy et al., 2014] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014. [Tramèr et al., 2017] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Dan Boneh, and Patrick Mc Daniel. Ensemble adversarial training: Attacks and defenses. ar Xiv preprint ar Xiv:1705.07204, 2017. [Xiao et al., 2018] Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, and Dawn Song. Spatially transformed adversarial examples. ar Xiv preprint ar Xiv:1801.02612, 2018. [Zagoruyko and Komodakis, 2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016. [Zhu et al., 2016] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the natural image manifold. In ECCV, pages 597 613. Springer, 2016. [Zhu et al., 2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV, pages 2242 2251, 2017.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)