# distributionally_adversarial_attack__52ac88c8.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Distributionally Adversarial Attack

Tianhang Zheng,1 Changyou Chen,1 Kui Ren1,2

1State University of New York at Buffalo 2Zhejiang University {tzheng4, changyou, kuiren}@buffalo.edu

Recent work on adversarial attack has shown that Projected Gradient Descent (PGD) Adversary is a universal ﬁrst-order adversary, and the classiﬁer adversarially trained by PGD is robust against a wide range of ﬁrst-order attacks. It is worth noting that the original objective of an attack/defense model relies on a data distribution p(x), typically in the form of risk maximization/minimization, e.g., max/min Ep(x)L(x) with p(x) some unknown data distribution and L( ) a loss function. However, since PGD generates attack samples independently for each data sample based on L( ), the procedure does not necessarily lead to good generalization in terms of risk optimization. In this paper, we achieve the goal by proposing distributionally adversarial attack (DAA), a framework to solve an optimal adversarial-data distribution, a perturbed distribution that satisﬁes the L constraint but deviates from the original data distribution to increase the generalization risk maximally. Algorithmically, DAA performs optimization on the space of potential data distributions, which introduces direct dependency between all data points when generating adversarial samples. DAA is evaluated by attacking stateof-the-art defense models, including the adversarially-trained models provided by MIT Madry Lab. Notably, DAA ranks the ﬁrst place on Madry Lab s white-box leaderboards, reducing the accuracy of their secret MNIST model to 88.56% (with l perturbations of ϵ = 0.3) and the accuracy of their secret CIFAR model to 44.71% (with l perturbations of ϵ = 8.0). Code for the experiments is released on https://github.com/ tianzheng4/Distributionally-Adversarial-Attack.

Introduction Recent years have witnessed widespread use of deep neural networks (DNNs), achieving remarkable performance on different machine-learning tasks, such as object detection and recognition (Krizhevsky, Sutskever, and Hinton 2012), strategy optimization (Silver et al. 2016), and natural language processing (Cho et al. 2014). At the same time, DNNs also have been proved to be vulnerable to adversarial samples data that are indistinguishable from natural samples by human but endow additional maliciously-embedded perturbations. Those maliciously perturbed samples can cause DNNs to make predictions different from the ground truth with high conﬁdence. Various ﬁrst-order algorithms have

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

been proposed to generate adversarial samples, such as Fast Gradient Sign Method (FGSM) (Szegedy et al. 2013), Projected Gradient Descent (PGD) (Kurakin, Goodfellow, and Bengio 2016b), and Carlini & Wagner Attacks (CW) (Carlini and Wagner 2017). Among all those ﬁrst-order attacks, Madry et al. (2017) suggest that PGD is a universal attack algorithm, and the classiﬁer adversarially trained by PGD is robust against a wide range of ﬁrst-order attacks. Carlini et al. (2017) strengthen the hypothesis by demonstrating that PGDadversarial training provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4.2. Moreover, among all the white-box defenses that appeared in ICLR-2018 and CVPR-2018, PGD-adversarial training is the only empirical defense that has not been further attacked (Athalye, Carlini, and Wagner 2018; Athalye and Carlini 2018). Despite the success of PGD, one notable limitation is that the adversarial samples are not globally optimal, in the sense that adversarial samples are generated independently for each data sample. From a machine-learning perspective, this lacks a statistical interpretation in terms of risk maximization, i.e., PGD is not a training procedure, thus the underlying optimization problem is not mathematically clear. In this paper, we provide a distribution-optimization view of PGD, and propose distributionally adversarial attack (DAA), a new concept of adversarial attack that is performed on the space of probability measures (e.g., unknown data distributions). In DAA, the problem is formulated as optimizing an adversarial data distribution (from which adversarial samples are drawn from) such that the generalization risk increases maximumly. This generalizes PGD by lifting the optimization onto the space of probability measures, and can be interpreted as Wasserstein gradient ﬂows (WGFs), a framework for distribution optimization which always decreases an energy functional over time. The energy functional reﬂexes the data manifold in adversarial attack, and is designed in correspondence with the original objective of a DNN. When using testing data to approximate the unknown data manifold, DAA leads to a variant of the standard PGD where all adversarial samples are explicitly dependent. DAA is extensively evaluated on four datasets, including MNIST, Fashion MNIST (FMNIST), CIFAR10, and Imagenet, by attacking their state-of-the-art defense models. We

show that a single run of DAA with 0.3/1.0 l perturbations can reduce the accuracy of Madry Lab s MNIST model (Madry et al. 2017) to approximately 90.5%, outperforming a single run of PGD that reduces the accuracy to approximately 92.5%. Furthermore, a single run of DAA with 0.2/1.0 l perturbations reduces the accuracy of our adversarially trained FMNIST model to 67.6%, outperforming a single run of PGD that reduces the accuracy to 71.5%. Similarly, a single run of DAA reduces the accuracy of Madry Lab s CIFAR10 model to 44.98% (8/255 l perturbations) and the accuracy of the ensemble adversarial trained Imagenet model (Kurakin, Goodfellow, and Bengio 2016b) to 16.43% (only 2/255 l perturbations). For DAA with 50 random restarts, it reduces the accuracy of Madry Lab s public and secret MNIST model to 88.7% and 88.56%, respectively; whereas DAA with 10 random restarts reduces the accuracy of Madry Lab s secret CIFAR10 model to 44.71%. Both settings outperform other attack algorithms listed in MIT Madry Lab s white-box leaderboards.1

Preliminaries We introduce necessary background in this section, including Wasserstein gradient ﬂows and adversarial attack/defense methods.

Wasserstein Gradient Flows Wasserstein Metric Space Wasserstein metric is a distance metric deﬁned between probability measures (distributions) on the Wasserstein metric space. Formally, let P2(Ω) denote the collection of all probability measures on Ω Rr with ﬁnite 2nd moment. The 2nd-order Wasserstein distance between two probability measures in P2(Ω) is deﬁned as:

W 2 2 (µ, ν) inf γ { Z

Ω Ω || x x ||2 2dγ(x, x ) : (1)

γ Ω(µ, ν)},

where Γ(µ, ν) denotes the collection of all joint probability measures on Ω Ωwith two marginals equal to µ and ν. One way to understand the motivation of the above deﬁnition is to consider the optimal transport problem, where one wants to transform elements in the domain of µ to ν with minimum cost. The cost to transport x in µ to x in ν is quantiﬁed by || x x ||2 2. If µ is absolutely continuous w.r.t. the Lebesgue measure, there exists a unique optimal transport plan, i.e., a mapping T : Rr Rr, to transform elements in µ to elements in ν. The Wasserstein distance can be equivalently reformulated as: W 2 2 (µ, ν) inf T { R

Ω|| x T(x)||2 2dµ(x)}.

Wasserstein Gradient Flows The 2nd-order Wasserstein metric endows P2(Ω) with a Riemannian geometry. Let {µt}t [0,1] be an absolutely continuous curve on this geometry, with change between µt and µt+h measured by W 2(µt, µt+h). The change can be reﬂected by a vector ﬁeld: vt(x) limh 0 T (x) x

h . This vector ﬁeld is regarded as the

1https://github.com/Madry Lab/mnist challenge; https://github. com/Madry Lab/cifar10 challenge

velocity ﬁeld of the elements x. Based on the above description, a gradient ﬂow can be deﬁned on P2(Ω) in Lemma 1. Lemma 1 Let {µt}t [0,1] be an absolutely continuous curve in P2(Ω). Then for a.e. t [0, 1], the above vector ﬁeld vt deﬁnes a gradient ﬂow on P2(Ω) as: tµt + (vtµt) = 0. Actually, the velocity ﬁeld v can be derived for optimization on an energy functional E : P2(Ω) R. In this case, it can be shown that vt has the form vt = δE

δµt (Ambrosio, Gigli, and Savar e 2008), where δE

δµt is called the ﬁrst variation of E at µt. Thus the gradient ﬂow on P2(Ω) can be rewritten as:

tµt = (vtµt) = (µt δE

Adversarial Sample Deﬁnition and Notation In this paper, we only study adversarial samples on neural networks used for classiﬁcation, with ﬁnal layers as softmax activation functions. We represent such a network as a vector function {Fi(x)}i. Given an input x, the network predicts its label as y = argi max Fi(x). A sample x is called an adversarial sample if argi max F(x ) = y, where y is the true label and x is close to the original x under a certain distance metric.

Fast Gradient Sign Method (FGSM) Fast Gradient Sign Method (FGSM) is a single-step adversarial attack proposed by Szegedy et al. (2013). FGSM performs a single step update on the original sample x along the direction of the gradient of a loss function L(x, y; θ). The loss function is usually deﬁned as the cross-entropy between the output of a network and the true label y. Formally, FGSM adversarial samples are generated as

x = clip[0,1]{x +ϵ sign( x L(x, y; θ)} , (3)

where ϵ controls the maximum l perturbation of the adversarial samples, and the clip[a,b](x) function forces x to reside in the range of [a, b].

Projected Gradient Descent (PGD) Projected Gradient Descent (PGD) is an iterative variant of FGSM. In each iteration, PGD follows the update rule:

x l+1 = IIclip{FGSM(x l)} , (4)

where FGSM(x l) represents an FSGM update of x l as in (3), and the outer clip function IIclip keeps x l+1 within a predeﬁned perturbation range. PGD can also be interpreted as an iterative algorithm to solve the following problem:

max x :|| x x || <α L(x , y; θ). (5)

Madry et al. (2017) observe that the local maxima of the cross-entropy loss found by PGD with 105 random starts are distinctive, but all have similar loss values, for both normallyand adversarially-trained networks. Inspired by this concentration phenomena, they propose that PGD is a universal adversary among all the ﬁrst-order adversaries, i.e., attacks only rely on ﬁrst-order information.

Momentum-based Iterative Fast Gradient Sign Method (MI-FGSM) MI-FGSM is derived from the Iterative FGSM (Kurakin, Goodfellow, and Bengio 2016a), which integrates the momentum term into an iterative process to generate adversarial samples (Dong et al. 2018). Given g0 = 0 and gl+1 = µ gl + x L(x l,y;θ) || x L(x l,y;θ)||1 , the iterative version of MI-FGSM can be expressed as:

x l+1 = x l +ϵ sign(gl+1). (6)

Based on MI-FGSM, we further derive its PGD variant, called Momentum PGD, with iterative updates

x l+1 = x l + IIclip{ϵ sign(gl+1)}. (7)

Remark 2 Momentum PGD is a stronger attack than MIFGSM, since it can proceed for more steps with an appropriate step size ϵ (ϵ can not be too small, otherwise the adversarial samples are very likely to get trapped in bad local maxima). The clip function IIclip ensures the adversarial samples to have the predeﬁned perturbation size after the extra iterations.

Adversarial Training Deﬁnition and Notation Adversarial training is a defense method against adversarial samples ﬁrst proposed by Goodfellow, Shlens, and Szegedy (2014). The approach attempts to improve the robustness of a network by training it together with adversarial samples. Formally, adversarial training solves the following min-max problem:

min θ max x :D(x,x )<α L(x , y; θ) , (8)

where D(x, x ) represents certain distance metric between x and x . The inner maximization problem is equivalent to constructing the strongest adversarial samples. If l distance is employed as the distance metric D(x, x ), the inner maximization problem is equivalent to the adversarial problem solved by PGD, i.e., (5). The outer minimization is the standard training procedure to minimize the loss of a DNN. Recent work shows that this straightforward method is one of the most effective defenses against adversarial samples (Madry et al. 2017; Tram er et al. 2017; Cai et al. 2018).

PGD Adversarial Training The fact that PGD adversary is a ﬁrst-order universal adversary implies that robustness against PGD should yield robustness against all ﬁrst-order adversaries (Madry et al. 2017). Hence, Madry et al. (2017) propose to adversarially train a robust classiﬁer using PGD attack. Speciﬁcally, in each training iteration, PGD is applied to generate a minibatch of adversarial samples to update the current network. In the training process, a steady decrease of the training loss is usually observed, indicating the effectiveness of this training paradigm. Experiment results show that PGD adversarial-trained models are robust against PGD attack as well as another strong attacks such as the CW attack (Carlini and Wagner 2017). Empirically we also found that their MNIST and CIFAR-10 models are indeed robust to a wide range of existing ﬁrst-order attacks, including Deep Fool (Moosavi-Dezfooli, Fawzi, and Frossard

2016), and Jacobian-based Saliency Map Attack (JSMA) (Papernot et al. 2016), as long as the adversarial perturbations are l - bounded.

Ensemble Adversarial Training To scale up adversarial training to Image Net-scale datasets, Kurakin, Goodfellow, and Bengio (2016b) adversarially train a model using a fast single-step attack method. However, their adversariallytrained model is vulnerable to multi-step white-box attacks (Kurakin, Goodfellow, and Bengio 2016b). Tram er et al. (2017) further demonstrate that the model of Kurakin, Goodfellow, and Bengio (2016b) is vulnerable to blackbox adversaries (Tram er et al. 2017). To tackle this problem, Tram er et al. (2017) propose a training methodology that incorporates adversarial samples transferred from other pre-trained models, called Ensemble Adversarial Training (EAT) (Tram er et al. 2017). Intuitively, this approach increases the diversity of adversarial samples used for adversarial training. In their experiments, the models trained by EAT exhibit robustness against adversarial samples transferred from other holdout models, using various single-step and multi-step attacks.

Distributionally Adversarial Attack We ﬁrst interpret distributionally adversarial attack as WGFs, and then propose a speciﬁc energy functionals to construct a WGF for better adversarial-sample generation, and ﬁnally propose particle-approximation methods to solve the DAA problem, leading to a variant of the standard PGD.

Adversarial Attack as WGFs For a given DNN, the landscape of a loss function L(x, y; θ) constitutes a geometry structure indexed by input images x. From a probability perspective, under regularized conditions, it is natural to deﬁne a probability distribution for each input x based on the loss-function landscape, i.e.,

p(x, y; θ) exp{ L(x, y; θ)} . (9)

Since y is deterministic given x, we would omit y, and write p(x, y; θ) as p(x; θ) in the following for simplicity. Based on this, we explain our intuitions on generalizing adversarial attack on the space of data distributions in the following. First note that the objective of an adversarial attack (5) is equivalently rewritten as:

x = arg max x :D(x,x )<α{L(x , y; θ) L(x, y; θ)} , (10)

which describes the increase of a loss with an adversarial sample. On the space of probability measure, the loss is instead described by an energy functional E(µ), assuming the minimum being reached at p(x; θ). Consequently, instead of ﬁnding an optimal adversarial sample x for each x, DAA tries to ﬁnd an optimal adversarial-data distribution, µ , such that µ is close to p(x; θ) but increases E( ) maximumly, i.e.,

µ = arg max µ:W2(µ,p(x;θ))<α{E(µ) E(p)}

= arg max µ:W2(µ,p(x;θ))<α E(µ) . (11)

Theorem 3 The solution µ of (11) is equivalently described by the following PDE:

tµt = x µt x

δµt (µt) , (12)

and µ = µt where t = {inf{t } : µ0(x) = p(x; θ), W2(µt (x), p(x; θ)) < α}.

Energy Functional It is crucial to deﬁne an energy functional as it directly affects adversarial-sample behaviors. Recent studies (Song et al. 2017; Ma et al. 2018) show that adversarial samples/subspaces mainly lie in the low probability regions of the original data distribution, thus we expect adversarial distribution to derivate from the original distribution by optimization over the energy functional. Besides, the energy functional should also be simple enough to possess a unique solution, i.e., it should be convex w.r.t.µ on the space of probability measures.Therefore, we deﬁne a new energy functional as:

µ L(x, y; θ)dµ + c KL(µ||p) , (13)

where L(x, y; θ) is the loss of the system; KL(µ||p) is the KL-divergence between the adversarial distribution µ and the optimal data distribution p; and c is a hyperparameter balancing those two terms. Intuitively, maximizing (13) will increase the individual losses in addition to the deviation between the adversarial and original distributions. Note the energy functional (13) is still convex w.r.t.µ, maintaining the optimality condition and making the problem easier to solve.

Adversarial-Distribution Optimization and Adversarial-Sample Generation Note a closed-formed solution of (12) is infeasible given the energy functional deﬁned by (13). Following standard methods such as those in (Chen et al. 2018), we adopt particle approximation to solve (12). The idea is to approximate µ with a set of M particles {x(i)}M i=1 as µ 1 M PM i=1 δx(i), where δx is a delta function with a spike at x. Consequently, solving for the optimal µ corresponds to optimizing the particles as adversarial samples from the adversarial distribution. In the following, based on (Chen et al. 2018), we investigate two methods for particle approximation: the Lagrangian blob method and the discrete-gradient-ﬂow method.

Lagrangian Blob Method The idea is to use particle approximations directly in the original problem (12). Specifically, deﬁne vt x δE δµt (µt) . According to (Carrillo, Craig, and Patacchini 2017), vt is interpreted as the velocity function of a particle in the gradient ﬂow. Consequently, Lagrangian blob methods evolve particles on a grid with a time-spacing h following the velocity vt (Carrillo, Craig, and Patacchini 2017). Thus solving the WGF (12) is equivalent to evolving the particles along their velocities as

d x(i) /dt = vt(x(i)) .

To calculate vt, we substitute the form of E(µ) in (13) into vt. First note that under the H-Wasserstein distance metric deﬁned by (Liu et al. 2017), we have

= E x µt[ x[p( x)K(x, x)]/p( x)]

=E x µt[K(x, x) x log p( x) + x K(x, x)], (14)

where K( , ) is a kernel function such as the RBF kernel. For the ﬁrst term on the right of (13), we have

µ L(x, y; θ)dµ

= x L(x, y; θ). (15)

Combining Eq. 14 and 15, one ends up solving the following ordinary differential equation: d x

dt = x L(x, y; θ)+ (16)

c E x µt[K(x, x) x log p( x) + x K(x, x)].

Considering a discrete approximation of µt with particles, (16) can be solved numerically as

x(i) ℓ+1 = x(i) ℓ+ϵl { x(i) ℓL(x(i) ℓ, y(i); θ) + (17)

j=1 [K(x(i) ℓ, x(j) ℓ) x(j) ℓ log p(x(j) ℓ)+

x(j) ℓK(x(i) ℓ, x(j) ℓ)]},

where, in contrast to the continuous case, we use ℓto index the number of steps for the discretized particles.

Discrete Gradient Flows Discrete-gradient-ﬂow (DGFs) approximation for (12) consists of a sequence of suboptimization problems whose composition approximates µt, i.e., µt µL µ1, where L = t/h, and µℓis the solution of the following functional optimization problem 2:

µℓ= arg max µ P2(Rr){E(µ) 1

2h W 2 2 ( µℓ 1, µ)}. (18)

Again, (18) is solved by particle approximation, with gradient ascent on the particles. To this end, we need gradients for the two terms on the RHS of (18). Inspired by (Chen et al. 2018), we decompose the two terms and re-organize terms:

L(x(i) ℓ, y(i); θ) c log p(x(i) ℓ)

E2 Eµ[log µ] + 1

2h W 2 2 (µ, µℓ 1) .

The gradient of the ﬁrst term can be easily calculated as E1 x(i) ℓ = x(i) ℓL(x(i) ℓ, y(i); θ) + c x(i) ℓlog p(x(i) ℓ)

= (1 + c) x(i) ℓL(x(i) ℓ, y(i); θ) . (19)

2Note the difference between (18) and the original DGF formula is to replace the original min to max because the ﬂow direction is reversed in adversarial-distribution optimization.

For the E2 term, we apply similar idea as (Chen et al. 2018) by introducing Lagrangian multipliers, resulting in

E2 x(i) ℓ c [ X

j 2uivj(dij

λ (x(i) x(j) k 1)], (20)

where λ, ui and vj are Lagrangian multipliers, and dij = || x(i) x(j) k 1 ||2 2. For the sake of simplicity, we do not update ui and vj, but instead use a ﬁxed scaling factor γ to approximate the product uivj.

Adversarial-Sample Generation Once an adversarial distribution is learned, adversarial samples, e.g. x(i) ℓ, can be obtained by drawing samples from it. However, adversarial samples typically follow certain constraints, e.g., l bounded. We propose two adversarial-sample generation methods based on the particle-optimization formula above, named DAA-BLOB and DAA-DGF. DAA-BLOB substitutes the gradient used in PGD with the gradient derived in Eq. 17. Formally, in each iteration, xi is updated by

x(i) l+1 = Πclip{x(i) l +ϵ sign( xi l L(x(i) l , y(i); θ)+ (21)

j=1 K(x(i) l , x(j) l ) x(j) l L(x(j) l , y(j); θ)+

x(j) t K(x(i) l , x(j) l )])}.

By contrast, DAA-DGF substitutes the gradient used in PGD with a combination of Eq. 19 and Eq. 20. Speciﬁcally, in each iteration, xi is updated by

x(i) l+1 =Πclip{x(i) l +ϵ sign( x(i) l L(x(i) l , y(i); θ)

2γc 1 + c [

λ (x(i) l x(j) l )])}. (22)

Optimization by Data Subsampling In theory, the adversarial distribution µ to be optimized corresponds to a data manifold. Thus a good discrete approximation to µ is to use all the testing samples. In practice, however, it is computationally infeasible to update the particles following (21) or (22), as the complexity is O(M 2) for each particle update. To mitigate this issue, we propose a subsampling method to update testing samples in an unbiased and computationally feasible way: First, testing samples are randomly permuted and divided into ﬁnite number of minibatches; then samples in each minibatch are updated sequentially for a certain number of steps. This procedure is iterated for multiple rounds. The full algorithm is shown in Algorithm 1.

Connection with PGD There are two situations where the proposed DAA framework reduces to PGD. The ﬁrst situation is when c = 0, where the second terms of both (21) and (22) become 0, making the two gradients degraded to the gradient used in PGD. The second case is when M = 1,

Algorithm 1 DAA algorithm (untargeted attack) Require: A classiﬁer with loss function L(xi, yi; θ); testing dataset {xi, yi}N i=1; minibatch size M; step size ϵ; predeﬁned ﬁnal perturbation size α; total iterations L; rounds R; hyperparameter c or 2γc

1+c; Random Start: xi 0 = xi +γi (γi U( α, α)) No Random Start: xi 0 = xi for r = 0 to R 1 do Randomly permutate the testing samples for k = 0 to L/R 1 do l = r L/R + k for j = 0 to N/M 1 do Follow Eq. 21 or 22 to update the minibatch {xi l, yi l} (i = j M + 1 (j + 1)M), where IIclip( ) = clip[0,1](clip[xi α,xi +α]( )) ([0, 1] is the pixel value range, maybe [ 1, 1] or [0, 255]) end for end for end for

meaning that DAA is equivalent to PGD when the datamanifold is approximated by only one sample. In this case, the second term of the gradient used in DAA-DGF becomes 0, and the second term of the gradient used in DAA-BLOB reduces to c [K(xi t, xi t) xi t L(xi t, yi; θ)+ xj t K(xi t, xi t)] = c xi t L(xi t, yi; θ).

Experiment Setup Datasets and Related Models The proposed DAA together with state-of-the-art methods, PGD and Momentum PGD, are evaluated and compared on four standard datasets, including MINST, Fashion MNIST (FMNIST), CIFAR10 and Image Net. For MINST, the attack target is the state-ofthe-art PGD-adversarially-trained MINST model provided by MIT Madry Lab (Madry et al. 2017). The defense architecture contains a convolutional neural network (CNN) with two convolutional layers and a fully connected layer. For FMNIST, we adversarially train a model by PGD as the target model. The network architecture consists of four convolutional layers and a fully connected layer with batch normalization. For CIFAR10, Madry Lab s PGD adversarially trained CIFAR10 model is adopted as the target model. The network architecture is a residual CNN consisting of ﬁve residual units and a fully connected layer. For Image Net, we adopt the target model in (Kurakin, Goodfellow, and Bengio 2016c), which is an adversarially trained Inception Res Netv2 model. In addition, we also evaluate DAA on a provable defense model (Wong and Kolter 2018), with code also provided in the Github link (in abstract)

Implementation Details For all the methods related to kernel functions, an RBF kernel K(x, x ) = exp( x x 2 2/h) is adopted. The bandwidth is set as h = med2/ log M, same as the kernel used in (Liu and Wang 2016) and (Chen et al. 2018). Here med is the

Figure 1: Comparison between PGD and DAA. DAA tends to generate more structured perturbations.

Table 1: Empirical worst-case accuracy of MIT Madry Lab s secret MNIST model under 200-step attacks with 50 random starts (0.3/1.0 l perturbations). Loss 1: cross-entropy, Loss 2: CW loss

Rand+FGSM PGD Momentum PGD DAA-BLOB DAA-DGF Loss 1 Loss 2 Loss 1 Loss 2 Loss 1 Loss 2 Loss 1 Loss 2 Loss 1 Loss 2 93.48% 93.47% 89.49% 89.57% 89.29% 89.36% 88.79% 88.85% 88.92% 89.25%

Figure 2: Averaged classiﬁcation accuracy of MIT Madry Lab s adversarially trained MNIST model under a single run of different attacks: cross-entropy (left), CW loss (right)

median of the pairwise distance between particles. The minibatch size (number of particles) is set to 100 200 for computational feasibility. Our speciﬁc settings on hyper-parameters c and 2γc 1+c can be found in our Github link (in abstract). It is worth noting that the discrepancy regarding the parameter choices on those datasets is caused by different pixel ranges and network structures used by their classiﬁers. All experiments are conducted on a single Titan V GPU under a white-box setting, where an adversary has full access to a target model including model weights.

Adversarial Perturbation Analysis

To intuitively understand the advantage of DAA over PGD, we plot ten natural samples and their PGD and DAA adversarial samples in Figure 1. For the ten samples, DAA successfully attacks the defense model with a 0.3/1.0 l perturbation, whereas PGD with 50 random starts cannot. As shown in Figure 1, the perturbations generated by PGD tend to scatter throughout the images, whereas those of DAA are more structuredly focused around the target digits.

Figure 3: Averaged classiﬁcation accuracy of adversarially trained FMNIST model under a single run of different attacks: cross-entropy (left), CW loss (right)

Empirical Results

MNIST We plot the averaged classiﬁcation accuracy of Madry Lab s adversarially trained MNIST model under a single run of different attack algorithms in Figure 2. It is observed that the proposed DAA consistently outperforms other methods under different levels of l perturbations and two different losses. To test its statistic signiﬁcance, we also conduct paired t-tests between the accuracies reduced by PGD and DAA with random starts. For DAA-BLOB and DAA-DGF, the p-values are almost zeros, i.e., 0.0 for DAABLOB and 5e-43 for DAA-DGF in the given decimal degree accuracy, suggesting that both methods outperform PGD in a statistical sense. In addition, we show the worst classiﬁcation accuracy of Madry s adversarially trained model under PGD, Momentum PGD and DAA with 50 random restarts in Table 1. It is seen that DAA-BLOB is able to reduce the classiﬁcation accuracy to approximately 88.79% (with c = 1.1 and minibatch size of 200), outperforming the attacks listed in Madry Lab s white-box leaderboard.

FMNIST We next plot the classiﬁcation accuracy of our adversarially trained FMNIST model under a single run of

Table 2: Empirical worst-case accuracy of adversarially trained FMNIST model under 100-step attacks with 10 random starts (0.2/1.0 l perturbations). Loss 1: cross-entropy, Loss 2: CW loss

Rand+FGSM PGD Momentum PGD DAA-BLOB DAA-DGF Loss 1 Loss 2 Loss 1 Loss 2 Loss 1 Loss 2 Loss 1 Loss 2 Loss 1 Loss 2 77.45% 77.21% 68.54% 68.94% 69.72% 69.51% 65.70% 66.64% 66.04% 66.60%

Table 3: Empirical worst-case accuracy of MIT Madry Lab s adversarially trained CIFAR10 model under a single run of 100step attacks without random start (8, 16/255 l perturbations). Loss 1: cross-entropy, Loss 2: CW loss

l Rand+FGSM PGD Momentum PGD DAA-BLOB DAA-DGF Loss 1 Loss 2 Loss 1 Loss 2 Loss 1 Loss 2 Loss 1 Loss 2 Loss 1 Loss 2 8/255 55.63% 55.05% 45.09% 46.27% 45.86% 46.76% 44.98% 46.30% 45.07% 46.31% 16/255 38.80% 37.78% 14.59% 16.06% 17.73% 18.70% 14.43% 16.05% 14.52% 16.06%

different attack algorithms in Figure 3. Similarly, the proposed DAA-based methods outperform the state-of-the-art PGD method. The two variants, DAA-BLOB and DAADGF, are comparable. Again, the p-values in the t-tests are also close to zero, i.e., 0.0 for DAA-BLOB and 3e-27 for DAA-DGF, indicating signiﬁcant performance differences between PGD and the proposed methods. The worst accuracy under white-box PGD, Momentum PGD and DAA with 10 random starts are listed in Table 2, suggesting the advance of the proposed DAA framework over both PGD and Momentum PGD.

CIFAR10 The classiﬁcation accuracies of Madry s adversarially trained model under a single run of PGD, Momentum PGD and DAA without random start are shown in Table. 3. As can be seen, the adversarially-trained model only achieves weak robustness, i.e., a 100-step PGD with 16/255 l perturbations reduces the accuracy to 14.59%, while DAA only reduces it to 14.43%. We conjecture this is because the data are too complex and sparse, making the particle approximation with testing samples in DAA badly represent the true data manifold. As a result, testing results with the learned adversarial samples are similar to those of PGD. This argument is also validated by Recht et al. (2018), which shows that existing high-accuracy CIFAR10 classiﬁers does not generalize well to a truly unseen CIFAR10 testing set.

Image Net We also evaluated the ensemble adversarial trained Inception Res Net-v2 under 50-step targeted Rand+FGSM and DAA attacks, using the least likely class as the target. Our experiments show that Rand+FGSM with 2/255 l perturbations (approximately 0.0157/2.0 l perturbations after normalization) can reduce the accuracy of the Inception model to approximately 70% (Kurakin, Goodfellow, and Bengio 2016b), while 50-step DAA can dramatically reduce it to 16.43%. This indicates the ensemble adversarial trained Image Net model is still vulnerable to welldesigned iterative attacks, e.g., our DAA.

To our knowledge, there was not a ﬁrst-order l attack algorithm that can really outperform PGD under the whitebox setting. Even for MI-FGSM, which won the NIP2017 competition under a black-box setting, we found that its PGD variant, which is stronger than MI-FGSM, cannot outperform standard PGD under the white-box setting, let alone MI-FGSM. In this paper, we generalize PGD on the space of data distributions. Our theoretical derivations and experiments validate the effectiveness of our framework. To the best of our knowledge, the proposed DAA framework is the ﬁrst-and-only ﬁrst-order l attack algorithm that can outperform PGD, especially on robust adversarially trained models. To further attack those adversariallytrained models with small l perturbations, we might have to involve higher-order information, which is usually very computationally expensive. In practice, those l adversarially trained models can also be further attacked by perturbing few pixels with large l perturbations, which still yields small l1 or l2 distance (Sharma and Chen 2018). However, such a change sometimes even leads to misclassiﬁcation by human. Moreover, it is unfair to compare an l1 or l2 attack with an l attack (PGD) on l adversarially trained models.

We generalize PGD on the space of data distributions, by learning an adversarial data distribution that maximally increases the generalization risk of a model. To solve the adversarial-distribution problem, we deﬁne a new energy functional to better reﬂex the discriminative data manifold in the WGF framework. When adopting particle approximation, adversarial samples can be generated by solving the corresponding WGF problems, leading to an algorithm closely related to the standard PGD method. Extensive evaluations show that our distributionally-adversarial attack outperforms PGD and Momentum PGD, achieving state-of-theart attack results on the adversarially trained defense and provable defense models that demonstrated notable robustness against ﬁrst-order l attacks.

Ambrosio, L.; Gigli, N.; and Savar e, G. 2008. Gradient ﬂows: in metric spaces and in the space of probability measures. Springer Science & Business Media. Athalye, A., and Carlini, N. 2018. On the robustness of the cvpr 2018 white-box adversarial example defenses. ar Xiv preprint ar Xiv:1804.03286. Athalye, A.; Carlini, N.; and Wagner, D. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. ar Xiv preprint ar Xiv:1802.00420. Cai, Q.-Z.; Du, M.; Liu, C.; and Song, D. 2018. Curriculum adversarial training. ar Xiv preprint ar Xiv:1805.04807. Carlini, N., and Wagner, D. 2017. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, 39 57. IEEE. Carlini, N.; Katz, G.; Barrett, C.; and Dill, D. L. 2017. Ground-truth adversarial examples. Co RR abs/1709.10207. Carrillo, J. A.; Craig, K.; and Patacchini, F. S. 2017. A blob method for diffusion. ar Xiv preprint ar Xiv:1709.09195. Chen, C.; Zhang, R.; Wang, W.; Li, B.; and Chen, L. 2018. A uniﬁed particle-optimization framework for scalable bayesian sampling. ar Xiv preprint ar Xiv:1805.11659. Cho, K.; Van Merri enboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078. Dong, Y.; Liao, F.; Pang, T.; Su, H.; Zhu, J.; Hu, X.; and Li, J. 2018. Boosting adversarial attacks with momentum. ar Xiv preprint. Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, 1097 1105. Kurakin, A.; Goodfellow, I.; and Bengio, S. 2016a. Adversarial examples in the physical world. ar Xiv preprint ar Xiv:1607.02533. Kurakin, A.; Goodfellow, I.; and Bengio, S. 2016b. Adversarial machine learning at scale. ar Xiv preprint ar Xiv:1611.01236. Kurakin, A.; Goodfellow, I. J.; and Bengio, S. 2016c. Adversarial machine learning at scale. Co RR abs/1611.01236. Liu, Q., and Wang, D. 2016. Stein variational gradient descent: A general purpose bayesian inference algorithm. 2378 2386. Liu, Y.; Ramachandran, P.; Liu, Q.; and Peng, J. 2017. Stein variational policy gradient. Co RR abs/1704.02399. Ma, X.; Li, B.; Wang, Y.; Erfani, S. M.; Wijewickrema, S.; Houle, M. E.; Schoenebeck, G.; Song, D.; and Bailey, J. 2018. Characterizing adversarial subspaces using local intrinsic dimensionality. ar Xiv preprint ar Xiv:1801.02613.

Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and Vladu, A. 2017. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083. Moosavi-Dezfooli, S.-M.; Fawzi, A.; and Frossard, P. 2016. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2574 2582. Papernot, N.; Mc Daniel, P.; Jha, S.; Fredrikson, M.; Celik, Z. B.; and Swami, A. 2016. The limitations of deep learning in adversarial settings. In Security and Privacy (Euro S&P), 2016 IEEE European Symposium on, 372 387. IEEE. Recht, B.; Roelofs, R.; Schmidt, L.; and Shankar, V. 2018. Do cifar-10 classiﬁers generalize to cifar-10? ar Xiv preprint ar Xiv:1806.00451. Sharma, Y., and Chen, P.-Y. 2018. Attacking the madry defense model with l 1-based adversarial examples. In Proc. of AAAI. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of go with deep neural networks and tree search. nature 529(7587):484 489. Song, Y.; Kim, T.; Nowozin, S.; Ermon, S.; and Kushman, N. 2017. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. ar Xiv preprint ar Xiv:1710.10766. Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199. Tram er, F.; Kurakin, A.; Papernot, N.; Goodfellow, I.; Boneh, D.; and Mc Daniel, P. 2017. Ensemble adversarial training: Attacks and defenses. ar Xiv preprint ar Xiv:1705.07204. Wong, E., and Kolter, Z. 2018. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, 5283 5292.