# adversarial_example_games__829e4986.pdf Adversarial Example Games Avishek Joey Bose Mila, Mc Gill University joey.bose@mail.mcgill.ca Gauthier Gidel Mila, Université de Montréal gauthier.gidel@umontreal.ca Hugo Berard Mila, Université de Montréal Facebook AI Research Andre Cianflone Mila, Mc Gill University Pascal Vincent Mila, Université de Montréal Facebook AI Research Simon Lacoste-Julien Mila, Université de Montréal William L. Hamilton Mila, Mc Gill University The existence of adversarial examples capable of fooling trained neural network classifiers calls for a much better understanding of possible attacks to guide the development of safeguards against them. This includes attack methods in the challenging non-interactive blackbox setting, where adversarial attacks are generated without any access, including queries, to the target model. Prior attacks in this setting have relied mainly on algorithmic innovations derived from empirical observations (e.g., that momentum helps), lacking principled transferability guarantees. In this work, we provide a theoretical foundation for crafting transferable adversarial examples to entire hypothesis classes. We introduce Adversarial Example Games (AEG), a framework that models the crafting of adversarial examples as a min-max game between a generator of attacks and a classifier. AEG provides a new way to design adversarial examples by adversarially training a generator and a classifier from a given hypothesis class (e.g., architecture). We prove that this game has an equilibrium, and that the optimal generator is able to craft adversarial examples that can attack any classifier from the corresponding hypothesis class. We demonstrate the efficacy of AEG on the MNIST and CIFAR-10 datasets, outperforming prior state-of-the-art approaches with an average relative improvement of 29.9% and 47.2% against undefended and robust models (Table 2 & 3) respectively. 1 Introduction Adversarial attacks on deep neural nets expose critical vulnerabilities in traditional machine learning systems [55, 3, 64, 8]. In order to develop models that are robust to such attacks, it is imperative that we improve our theoretical understanding of different attack strategies. While there has been considerable progress in understanding the theoretical underpinnings of adversarial attacks in relatively permissive settings (e.g. whitebox adversaries; [53]), there remains a substantial gap between theory and practice in more demanding and realistic threat models. In this work, we provide a theoretical framework for understanding and analyzing adversarial attacks in the highly-challenging Non-interactive black Box adversary (No Box) setting, where the attacker has no direct access, including input-output queries, to the target classifier it seeks to fool. Instead, Equal Contribution, order chosen via randomization. Canada CIFAR AI Chair 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. the attacker must generate attacks by optimizing against some representative classifiers, which are assumed to come from a similar hypothesis class as the target. The No Box setting is a much more challenging setting than more traditional threat models, yet it is representative of many real-world attack scenarios, where the attacker cannot interact with the target model [15]. Indeed, this setting as well as the general notion of transferring attacks between classifiers has generated an increasing amount of empirical interest [25, 51, 73, 71]. The field, however, currently lacks the necessary theoretical foundations to understand the feasibility of such attacks. Contributions. To address this theoretical gap, we cast No Box attacks as a kind of adversarial example game (AEG). In this game, an attacker generates adversarial examples to fool a representative classifier from a given hypothesis class, while the classifier itself is trained to detect the correct labels from the adversarially generated examples. Our first main result shows that the Nash equilibrium of an AEG leads to a distribution of adversarial examples effective against any classifier from the given function class. More formally, this adversarial distribution is guaranteed to be the most effective distribution for attacking the hardest-to-fool classifiers within the hypothesis class, providing a worst-case guarantee for attack success against an arbitrary target. We further show that this optimal adversarial distribution admits a natural interpretation as being the distribution that maximizes a form of restricted conditional entropy over the target dataset, and we provide detailed analysis on simple parametric models to illustrate the characteristics of this optimal adversarial distribution. Note that while AEGs are latent games [30], they are distinct from the popular generative adversarial networks (GANs) [32]. In AEGs, there is no discrimination task between two datasets (generated one and real one); instead, there is a standard supervised (multi-class) classification task on an adversarial dataset. Guided by our theoretical results we instantiate AEGs using parametric functions i.e. neural networks, for both the attack generator and representative classifier and show the game dynamics progressively lead to a stronger attacker and robust classifier pairs. We empirically validate AEG on standard CIFAR and MNIST benchmarks and achieve state-of-the-art performance compared to existing heuristic approaches in nearly all experimental settings (e.g., transferring attacks to unseen architectures and attacking robustified models), while also maintaining a firm theoretical grounding. 2 Background and Preliminaries Suppose we are given a classifier f : X Y, an input datapoint x X, and a class label y Y, where f(x) = y. The goal of an adversarial attack is to produce an adversarial example x X, such that f(x ) = y, and where the distance3 d(x, x ) ϵ. Intuitively, the attacker seeks to fool the classifier f into making the wrong prediction on a point x , which is ϵ-close to a real data example x. Adversarial attacks and optimality. A popular setting in previous research is to focus on generating optimal attacks on a single classifier f [13, 53]. Given a loss function ℓ, used to evaluate f, an adversarial attack is said to be optimal if, x argmaxx X ℓ(f(x ), y) , s.t. d(x, x ) ϵ . (1) In practice, attack strategies that aim to realize (1) optimize adversarial examples x directly using the gradient of f. In this work, however, we consider the more general setting of generating attacks that are optimal against entire hypothesis classes F, a notion that we formalize below. 2.1 No Box Attacks Threat models specify the formal assumptions of an attack (e.g., the information the attacker is assumed to have access to), which is a core aspect of adversarial attacks. For example, in the popular whitebox threat model, the attacker is assumed to have full access to the model f s parameters and outputs [65, 33, 53]. In contrast, the blackbox threat model assumes restricted access to the model, e.g., only access to a limited number of input-out queries [18, 42, 57]. Overall, while they consider different access to the target model, traditional whitebox and blackbox attacks both attempt to generate adversarial examples that are optimal for a specific target (i.e., Equation 1). In this paper, we consider the more challenging setting of non-interactive black Box (No Box) attacks, intending to generate successful attacks against an unknown target. In the No Box setting, we assume 3We assume that the ℓ is used in this work, [33, 53] , but our results generalize to any distance d. no interactive access to a target model; instead, we only assume access to a target dataset and knowledge of the function class to which a target model belongs. Specifically, the No Box threat model relies on the following key definitions: The target model ft. The adversarial goal is to attack some target model ft : X Y, which belongs to an hypothesis class F. Critically, the adversary has no access to ft at any time. Thus, in order to attack ft, the adversary must develop attacks that are effective against the entirety of F. The target examples D. The dataset D contains the examples (x, y) that attacker seeks to corrupt. An hypothesis class F. As noted above, we assume that the attacker has access to a hypothesis class F to which the target model ft belongs.4 One can incorporate in F as much prior knowledge one has on ft (e.g., the architecture, dataset, training method, or regularization), going from exact knowledge of the target F = {ft} to almost no knowledge at all (e.g., F = {f Dense Nets}). A reference dataset Dref. The reference dataset Dref, which is similar to the training data of the target model (e.g., sampled from the same distribution) is used to reduce the size of the hypothesis class F (e.g., we know that the target model perfoms well at classification on Dref). A representative classifier fc. Finally, we assume that the attacker has the ability to optimize a representative classifier fc from the hypothesis class F. Given these four key components, we formalize the No Box setting as follows: Definition 1. The No Box threat model corresponds to the setting where the attacker (i) knows a hypothesis class F that the target model ft belongs to, (ii) has access to a reference dataset Dref that is similar to the the dataset used to train ft (e.g., sampled from the same distribution), and (iii) can optimize a representative classifier fc F. The attacker has no other knowledge of or access to the target model ft (e.g., no queries to ft are allowed). The goal is, for the attacker, to use this limited knowledge to corrupt the examples in a given target dataset D. Our definition of a No Box adversary (Def. 1) formalizes similar notions used in previous work (e.g., see Def. 3 in [67]). Previous work also often refers to related settings as generating blackbox transfer attacks, since the goal is to attack the target model ft while only having access to a representative classifier fc [25, 51, 73]. Note, that our assumptions regarding dataset access are relatively weak. Like prior work, the attacker is given the target data (i.e., the examples to corrupt) as input, but this is constitutive of the task (i.e., we need access to a target example in order to corrupt it). Our only assumption is to have access to a reference dataset Dref, which is similar to the dataset used to train the target model. We do not assume access to the exact training set. A stronger version of this assumption is made in prior works on blackbox transfer, as these approaches must craft their attacks on a known source model which is pretrained on the same dataset as the target model [67]. 3 Adversarial Example Games In order to understand the theoretical feasibility of No Box attacks, we view the attack generation task as a form of adversarial game. The players are the generator network g which learns a conditional distribution over adversarial examples and the representative classifier fc. The goal of the generator network is to learn a conditional distribution of adversarial examples, which can fool the representative classifier fc. The representative classifier fc, on the other hand, is optimized to detect the true label y from the adversarial examples (x , y) generated by g. A critical insight in this framework is that the generator and the representative classifier are jointly optimized in a maximin game, making the generator s adversarial distribution at the equilibrium theoretically effective against any classifier from the hypothesis class F that fc is optimized over. At the same time, we will see in Proposition 1 that the min and max in our formulation (AEG) can be switched. It implies that, while optimized, the model fc converges to a robust classifier against any attack generated by the generator g [53, 70], leading to increasingly powerful attacks as the adversarial game progresses. Framework. Given an input-output pair of target datapoints (x, y) D, the generator network g is trained to learn a distribution of adversarial examples pcond( |x, y) that conditioned on an example 4Previous work [67] usually assumes to have access to the architecture of ft; we are more general by assuming access to a hypothesis class F containing ft; e.g., Dense Nets can represent Conv Nets. to attack (x, y) maps a prior distribution pz on Z onto a distribution on X. The classifier network fc is simultaneously optimized to perform robust classification over the resulting distribution pg defined in (2) (below). Overall, the generator g and the classifier fc play the following, two-player zero-sum game: max g Gϵ min fc F E(x,y) D,z pz[ℓ(fc(g(x, y, z)), y)] =: ϕ(fc, g), (AEG) where the generator g Gϵ is restricted by the similarity constraint d(g(x, y, z), x) ϵ , x, y, z X Y Z. Once the generator g is trained, one can generate adversarial examples against any classifier in ft F, without queries, by simply sampling z pz and computing g(x, y, z). Connection with No Box attacks. The No Box threat model (Def. 1) corresponds to a setting where the attacker does not know the target model ft but only a hypothesis class F such that ft F. With such knowledge, one cannot hope to be better than the most pessimistic situation where ft is the best defender in F. Our maximin formulation (AEG) encapsulates such a worst-case scenario, where the generator aims at finding attacks against the best performing f in F. Objective of the generator. When trying to attack infinite capacity classifiers i.e., F contains any measurable function the goal of the generator can be seen as generating the adversarial distribution pg with the highest expected conditional entropy Ex[P y pg(y|x) log pg(y|x)], where pg is defined as (x , y) pg x = g(x, y, z) , (x, y) D , z pz with d(x , x) ϵ . (2) When trying to attack a specific hypothesis class F (e.g., a particular CNN architecture), the generator aims at maximizing a notion of restricted entropy defined implicitly through the class F. Thus, the optimal generator in an (AEG) is primarily determined by the statistics of the target dataset D itself, rather any specifics of a target model. We formalize these high level concepts in 4.2. Regularizing the Game. In practice, the target ft is usually trained on a non-adversarial dataset and performs well at a standard classification task. In order to reduce the size of the class F, one can bias the representative classifier fc towards performing well on a standard classification task with respect to Dref, which leads to the following game: max g Gϵ min fc F E(x,y) D,z pz[ℓ(fc(g(x, y, z)), y)] + λE(x,y) Dref[ℓ(fc(x), y)] =: ϕλ(f, g). (3) Note that λ = 0 recovers (AEG). Such modifications in the maximin objective as well as setting the way the models are trained (e.g., optimizer, regularization, additional dataset) biases the training of the fc and corresponds to an implicit incorporation of prior knowledge on the target ft in the hypothesis class F. We note that in practice, using a non-zero value for λ is essential to achieve the most effective attacks as the prior knowledge acts as a regularizer that incentivizes g to craft attacks against classifiers that behave well on data similar to Dref. 4 Theoretical results When playing an adversarial example game, the generator and the representative classifier try to beat each other by maximizing their own objective. In games, a standard notion of optimality is the concept of Nash equilibrium [56] where each player cannot improve its objective value by unilaterally changing its strategy. The minimax result in Prop. 1 implies the existence of a Nash equilibrium for the game, consequently providing a well defined target for learning (we want to learn the equilibrium of that game). Moreover, a Nash equilibrium is a stationary point for gradient descent-ascent dynamics; we can thus hope for achieving such a solution by using a gradient-descent-ascent-based learning algorithm on (AEG).5 Proposition 1. If ℓis convex (e.g., cross entropy or mean squared loss), the distance x 7 d(x, x ) is convex for any x X, one has access to any measurable g respecting the proximity constraint in (2), and the hypothesis class F is convex, then we can switch min and max in (AEG), i.e., min fc F max g Gϵ ϕλ(fc, g) = max g Gϵ min fc F ϕλ(fc, g) (4) 5Note that, similarly as in practical GANs training, when the classifier and the generator are parametrized by neural networks, providing convergence guarantees for a gradient based method in such a nonconvex-nonconcave minimax game is an open question that is outside of the scope of this work. Proof sketch. We first notice that, by (2) any g corresponds to a distribution pg and thus we have, ϕ(fc, g) := E(x,y) D,z pz[ℓ(fc(g(x, y, z)), y)] = E(x ,y) pg[ℓ(fc(x ), y)] =: ϕ(fc, pg) Consequently, we also have ϕλ(fc, g) = ϕλ(fc, pg). By noting ϵ := {pg : g Gϵ}, we have that, min fc F max pg ϵ ϕλ(fc, pg) = min fc F max g Gϵ ϕλ(fc, g) and max pg ϵ min fc F ϕλ(fc, pg) = max g Gϵ min fc F ϕλ(fc, g) In other words, we can replace the optimization over the generator g Gϵ with an optimization over the set of possible adversarial distributions ϵ induced by any g Gϵ. This equivalence holds by the construction of ϵ, which ensures that maxg Gϵ ϕλ(fc, g) = maxpg ϵ ϕλ(fc, pg) for any fc F. We finally use Fan s theorem [28] after showing that (fc, pg) 7 ϕλ(fc, pg) is convex-concave (by convexity of ℓand linearity of pg 7 Epg) and that ϵ is a compact convex set. In particular, ϵ is compact convex under the assumption that we can achieve any measurable g (detailed in A). The convexity assumption on the hypothesis class F, Prop. 1 applies in two main cases of interest: (i) infinite capacity, i.e., when F is any measurable function. (ii) linear classifiers with fixed features ψ : X Rp, i.e., F = {w ψ( ) , w R|Y| p}. This second setting is particularly useful to build intuitions on the properties of (AEG), as we will see in 4.1 and Fig. 1. The assumption that we have access to any measurable g, while relatively strong, is standard in the literature and is often stated in prior works as if g has enough capacity [33, Prop. 2]. Even if the class of neural networks with a fixed architecture do not verify the assumption of this proposition, the key idea is that neural networks are good candidates to approximate that equilibrium because they are universal approximators [38] and they form a set that is almost convex" [30]. Proving a similar minimax theorem by only considering neural networks is a challenging problem that has been considered by Gidel et al. [30] in a related setting. It requires a fined grained analysis of the property of certain neural network architecture and is only valid for approximate minimax. We believe such considerations outside of the scope of this work. 4.1 A simple setup: binary classification with logistic regression Let us now consider a binary classification setup where Y = { 1} and F is the class of linear classifiers with linear features, i.e fw(x) = w x. In this case, the payoff of the game (AEG) is, ϕ(fω, g) := E(x,y) D, z pz[log(1 + e y w g(x,y,z))] (5) This example is similar to the one presented in [33]. However, our purpose is different since we focus on characterizing the optimal generator in (4). We show that the optimal generator can attack any classifier in F by shifting the means of the two classes of the dataset D. Proposition 2. If the generator is allowed to generate any ℓ perturbations. The optimal linear representative classifier is the solution of the following ℓ1 regularized logistic regression w arg min w E(x,y) D[log(1 + e y w x+ϵ w 1)] . (6) Moreover if ω has no zero entry, the optimal generator is g (x, y) = x y ϵ sign(w ), is deterministic and the pair (fw , g ) is a Nash equilibrium of the game (5). A surprising fact is that, unlike in the general setting of Prop. 1, the generator in Prop.2 is deterministic (i.e., does not depend on a latent variable z).6 This follows from the simple structure of classifiers in this class, which allow for a closed form solution for g . In general, one cannot expect to achieve an equilibrium with a deterministic generator. Indeed, with this example, our goal is simply to illustrate how the optimal generator can attack an entire class of functions with limited capacity: linear classifiers are mostly sensitive to the mean of the distribution of each class; the optimal generator exploits this fact by moving these means closer to the decision boundary. 6Note also that one can generalize Prop. 2 to a perturbation with respect to a general norm , in that case, the ϵ-regularization for the classifier would be with respect to the dual norm := max u 1 , u . E.g., as previously noted by Goodfellow et al. [33], ℓ adversarial perturbation leads to a ℓ1 regularization. Clean Dataset Adversarial Dataset Poly3 Poly5 0 2 4 Iterations Linear Poly3 Poly5 Figure 1: Illustration of Proposition 3 for three classes of classifiers in the context of logistic regression for the two moon dataset of scikit-learn [60] with linear and polynomial (of degree 3 and 5) features. Left: Scatter plot of the clean or adversarial dataset and the associated optimal decision boundary. For the adversarial dataset, each corresponding clean example is represented with a / and is connected to its respective adversarial example / . Right: value of the F-entropy for the different classes as a function of the number of iterations. 4.2 General multi-class classification In this section, we show that, for a given hypothesis class F, the generated distribution achieving the global maximin against fc F can be interpreted as the distribution with the highest F-entropy. For a given distribution pg, its F-entropy is the minimum expected risk under pg one can achieve in F. Definition 2. For a given distribution (x, y) pg we define the F-entropy of pg as HF(pg) := min fc F E(x,y) pg[ℓ(fc(x), y)] where ℓis the cross entropy loss. (7) Thus F-entropy quantifies the amount of classification information" available in pg using the class of classifiers F. If the F-entropy is large, (x, y) pg cannot be easily classified with a function fc in F. Moreover, it is an upper-bound on the expected conditional entropy of the distribution pg. Proposition 3. The F-entropy is a decreasing function of F, i.e., for any F1 F2, HF1(pg) HF2(pg) Hy(pg) := Ex px[H(pg( |x))] . where H(p( |x)) := P y Y p(y|x) ln p(y|x) is the entropy of the conditional distribution p(y|x). Here pg is defined as in (2) and implicity depends on D. For a given class F, the solution to an (AEG) game can be seen as one which finds a regularized adversarial distribution of maximal F-entropy, max g Gϵ min fc F ϕλ(fc, g) = (1 + λ) max g Gϵ HF( 1 (1+λ)pg + λ (1+λ)Dref)] , (8) where the distribution 1 (1+λ)pg + λ (1+λ)Dref is the mixture of the generated distribution pg and the empirical distribution over the dataset Dref. This alternative perspective on the game (AEG) shares similarities with the divergence minimization perspective on GANs [40]. However, while in GANs it represents a divergence between two distributions, in (AEG) this corresponds to a notion of entropy. A high-level interpretation of F-entropy maximization is that it implicitly defines a metric for distributions which are challenging to classify with only access to classifiers in F. Overall, the optimal generated distribution pg can be seen as the most adversarial dataset against the class F. Properties of the F-entropy. We illustrate the idea that the optimal generator and the F-entropy depend on the hypothesis class F using a simple example. To do so, we perform logistic regression (5) with linear and polynomial (of degree 3 and 5) features (respectively called Linear, Poly3, and Poly5) on the two moon dataset of scikit-learn [60]. Note that we have Linear Poly3 Poly5. For simplicity, we consider a deterministic generator g(x, y) that is realized by computing the maximization step via 2D grid-search on the ϵ neighborhood of x. We train our models by successively fully solving the minimization step and the maximization step in (5). We present the results in Figure 1. One iteration corresponds to the computation of the optimal classifier against the current adversarial distribution pg (also giving the value of the F-entropy), followed by the computation of the new optimal adversarial p g against this new classifier. The left plot illustrates the fact that the way of attacking a dataset depends on the class considered. For instance, when considering linear classifiers, the attack is a uniform translation on all the data-points of the same class. While when considering polynomial features, the optimal adversarial dataset pushes the the corners of the two moons closer together. In the right plot, we can see an illustration of Proposition 3, where the F-entropy takes on a smaller value for larger classes of classifiers. Encoder Decoder z = Concat(z1 π(x), Enc(x)) Batch Input Figure 2: AEG framework architecture 5 Attacking in the Wild: Experiments and Results We investigate the application of our AEG framework to produce adversarial examples against MNIST and CIFAR-10 classifiers. First we investigate our performance in a challenging No Box setting where we must attack an unseen target model with knowledge of only its hypothesis class (i.e., architecture) and a sample of similar training data ( 5.1). Following this, we investigate how well AEG attacks transfer across architectures ( 5.2), as well as AEG s performance attacking robust classifiers ( 5.3). Experimental setup. We perform all attacks, including baselines, with respect to the ℓ norm constraint with ϵ = 0.3 for MNIST and ϵ = 0.03125 for CIFAR-10. For AEG models, we train both generator (g) and representative classifier (fc) using stochastic gradient descent-ascent with the Extra Adam optimizer [29] and held out target models, ft, are trained offline using SGD with Armijo line search [69]. Full details of our model architectures, including hyperparameters, employed in our AEG framework can be found in Appendix D.7 Baselines. Throughout our experiments we rely on four standard blackbox transfert attack strategies adapted to the No Box setting: the Momentum-Iterative Attack (MI-Attack) [24], the Input Diversity (DI-Attack) [73], the Translation-Invariant (TID-Attack) [25] and the Skip Gradient Method (SGMAttack) [71]. For fair comparison, we inherit all hyperparameter settings from their respective papers. Note that SGM-attack is only defined with architectures that contain skip connections (e.g. Res Nets). AEG Architecture. The high-level architecture of our AEG framework is illustrated in Figure 2. The generator takes the input x and encode it into ψ(x), then the generator uses this encoding to compute a probability vector p(ψ(x)) in the probability simplex of size K, the number of classes. Using this probability vector, the network then samples a categorical variable z according to a multinomial distribution of parameter p(ψ(x)). Intuitively, this category may correspond to a target for the attack. The gradient is backprogated across this categorical variable using the gumble-softmax trick [45, 52]. Finally, the decoder takes as input ψ(x), z and the label y to output an adversarial perturbation δ such that δ ϵ. In order to generate adversarial perturbations over images that obey ϵ-ball constraints, we employ a scaled tanh output layer to scale the output of the generator to (0, 1), subtract the clean images, and finally apply an elementwise multiplication by ϵ. We then compute ℓ(f(x + δ), y) where f is the critic and ℓthe cross entropy loss. Further details can be found in Appendix D. 5.1 No Box Attacks on a Known Architecture Class but Unknown Train Set We first evaluate the AEG framework in a No Box setting, where we know only the architecture of the target model and have only access to a sample of similar training data (but not the exact training data of the target model). To simulate having access to a similar (but not identical dataset) as the target model, for each dataset we create random equally-sized splits of the data (10000 examples per splits). Within each split we use one fold to train the split classifier which acts as the representative classifiers for all attackers who are then evaluated their ability to fool the remaining split classifiers 7Code: https://github.com/joeybose/Adversarial-Example-Games.git on unseen target examples D. For the MNIST dataset we consider Le Net classifier [47], while for CIFAR-10 we consider Res Net-18 [37]. Table 5.1 shows the results of our experiments on this task, averaged across all splits and folds. We see that our AEG approach achieves state-of-the-art results, either outperforming or matching (within a 95% confidence interval) all baselines in both settings. Note that this task is significantly more challenging than many prior blackbox attack setups, which assume access to the full training data of the target model.8 Dataset MI-Attack DI-Attack TID-Attack SGM-Attack AEG (Ours) MNIST 87.5 2.7 89.5 2.5 85.4 2.8 N/A 89.5 3.2 CIFAR-10 (Res18) 56.8 1.2 84.0 1.5 9.1 1.6 60.5 1.5 87.0 2.1 Table 1: Attack success rates, averaged across target models with 95% confidence intervals shown. indicates a statistically significant result as determined by the paired T-test when compared to AEG. 5.2 No Box Attacks Across Distinct Architectures We now consider No Box attacks where we do not know the architecture of the target model but where the training data is known a setting previously referred to as blackbox transfer [67]. For evaluation, we use CIFAR-10 and train 10 instances of VGG-16 [62], Res Net-18 (RN-18) [37], Wide Res Net (WR) [75], Dense Net-121 (DN-121) [39] and Inception-V3 architectures (Inc-V3) [66]. Here, we optimize the attack approaches against a single pre-trained classifier from a particular architecture and then evaluate their attack success on classifiers from distinct architectures averaged over 5 instantiations. Our findings when using Res Net-18, Dense Net-121 and the VGG-16 as the source architecture are provided in Table 2. Overall we find that AEG beats all other approaches and lead to a new state of the art. In particular AEG outperforms the best baseline in each setting by an average of 29.9% across the different source architectures with individual average gains of 9.4%, 36.2%, and 44.0% when using a RN-18 model, DN-121, and VGG-16 source models respectively. Source Attack VGG-16 RN-18 WR DN-121 Inc-V3 Clean 11.2 1.8 13.1 4.0 6.8 1.4 11.2 2.8 9.9 2.6 MI-Attack 63.9 2.6 74.6 0.8 63.1 2.4 72.5 2.6 67.9 3.2 DI-Attack 77.4 3.4 90.2 1.6 74.0 2.0 87.1 2.6 85.8 1.6 TID-Attack 21.6 2.6 26.5 4.8 14.0 3.0 22.3 3.2 19.8 1.8 SGM-Attack 68.4 3.6 79.5 1.0 64.3 3.2 73.8 2.0 70.6 3.4 AEG (Ours) 93.8 0.7 97.1 0.4 80.2 2.2 93.1 1.3 88.4 1.6 MI-Attack 54.3 2.2 62.5 1.8 56.3 2.6 66.1 3.0 65.0 2.6 DI-Attack 61.1 3.8 69.1 1.6 61.9 2.2 77.1 2.4 71.6 3.2 TID-Attack 21.7 2.4 23.8 3.0 14.0 2.8 21.7 2.2 19.3 2.4 SGM-Attack 51.6 1.4 60.2 2.6 52.6 1.8 64.7 3.2 61.4 2.6 AEG (Ours) 93.7 1.0 97.3 0.6 81.8 3.0 96.7 0.8 92.7 1.6 MI-Attack 49.9 0.2 50.0 0.4 46.7 0.8 50.4 1.2 50.0 0.6 DI-Attack 65.1 0.2 64.5 0.4 58.8 1.2 64.1 0.6 60.9 1.2 TID-Attack 26.2 1.2 24.0 1.2 13.0 0.4 20.8 1.4 18.8 0.4 AEG (Ours) 97.5 0.4 96.1 0.5 85.2 2.2 94.1 1.2 89.5 1.3 Table 2: Error rates on D for average No Box architecture transfer attacks with ϵ = 0.03125. The correspond to 2 standard deviations (95.5% confidence interval for normal distributions). 5.3 No Box Attacks Against Robust Classifiers We now test the ability of our AEG framework to attack target models that have been robustified using adversarial and ensemble adversarial training [53, 67]. For evaluation against PGD adversarial training, we use the public models as part of the MNIST and CIFAR-10 adversarial examples 8We include results on a more permissive settings with access to the full training data in Appendix C.1 challenge.9 For ensemble adversarial training, we follow the approach of Tramèr et al. [67] (see Appendix D.3). We report our results in Table 3 and average the result of stochastic attacks over 5 runs. We find that AEG achieves state-of-the-art performance in all settings, proving an average improvement in success rates of 54.1% across all robustified MNIST models and 40.3% on robustifi ed CIFAR-10 models. Dataset Defence Clean MI-Att DI-Att TID-Att SGM-Att AEG (Ours) Aens4 0.8 43.4 42.7 16.0 N/A 65.0 Bens4 0.7 20.7 22.8 8.5 N/A 50.0 Cens4 0.8 73.8 30.0 9.5 N/A 80.0 Dens4 1.8 84.4 76.0 81.3 N/A 86.7 Madry-Adv 0.8 2.0 3.1 2.5 N/A 5.9 RN-18ens3 16.8 17.6 21.6 33.1 19.9 52.2 WRens3 12.8 18.4 20.6 28.8 18.0 49.9 DN-121ens3 21.5 20.3 22.7 31.3 21.9 41.4 Inc-V3ens3 14.8 19.5 42.2* 30.2 35.5* 47.5 Madry-Adv 12.9 17.2 16.6 16.6 16.0 21.6 Table 3: Error rates on D for No Box known architecture attacks against Adversarial Training and Ensemble Adversarial Training. Attacks were done using WR. Deterministic attack. 6 Related Work In addition to non-interactive blackbox adversaries we compare against, there exists multiple hybrid approaches that combine crafting attacks on surrogate models which then serve as a good initialization point for queries to the target model [57, 61, 41]. Other notable approaches to craft blackbox transfer attacks learning ghost networks [48], transforming whitebox gradients with small Res Nets [50], and transferability properties of linear classifiers and 2-layer Re Lu Networks [17]. There is also a burgeoning literature of using parametric models to craft adversarial attacks such as the Adversarial Transformation Networks framework and its variants [4, 72]. Similar in spirit to our approach many attacks strategies benefit from employing a latent space to craft attacks [76, 68, 9]. However, unlike our work, these strategies cannot be used to attack entire hypothesis classes. Adversarial prediction games between a learner and a data generator have also been studied in the literature [11], and in certain situations correspond to a Stackelberg game Brückner and Scheffer [10]. While similar in spirit, our theoretical framework is tailored towards crafting adversarial attacks against a fixed held out target model in the novel No Box threat model and is a fundamentally different attack paradigm. Finally, Erraqabi et al. [27] also investigate an adversarial game framework as a means for building robust representations in which an additional discriminator is trained to discriminate adversarial example from natural ones, based on the representation of the current classifier. 7 Conclusion In this paper, we introduce the Adversarial Example Games (AEG) framework which provides a principled foundation for crafting adversarial attacks in the No Box threat model. Our work sheds light on the existence of adversarial examples as a natural consequence of restricted entropy maximization under a hypothesis class and leads to an actionable strategy for attacking all functions taken from this class. Empirically, we observe that our approach leads to state-of-the-art results when generating attacks on MNIST and CIFAR-10 in a number of challenging No Box attack settings. Our framework and results point to a promising new direction for theoretically-motivated adversarial frameworks. However, one major challenge is scaling up the AEG framework to larger datasets (e.g., Image Net), which would involve addressing some of the inherent challenges of saddle point optimization [5]. Investigating the utility of the AEG framework for training robustified models is another natural direction for future work. 9https://github.com/Madry Lab/[x]_challenge, for [x] in {cifar10, mnist}. Note that our threat model is more challenging than these challenges as we use non-robust source models. Broader Impact Adversarial attacks, especially ones under more realistic threat models, pose several important security, ethical, and privacy risks. In this work, we introduce the No Box attack setting, which generalizes many other blackbox transfer settings, and we provide a novel framework to ground and study attacks theoretically and their transferability to other functions within a class of functions. As the No Box threat model represents a more realistic setting for adversarial attacks, our research has the potential to be used against a class of machine learning models in the wild. In particular, in terms of risk, malicious actors could use approaches based on our framework to generate attack vectors that compromise production ML systems or potentially bias them toward specific outcomes. As a concrete example, one can consider creating transferrable examples in the physical world, such as the computer vision systems of autonomous cars. While prior works have shown the possibility of such adversarial examples i.e., adversarial traffic signs, we note that there is a significant gap in translating synthetic adversarial examples to adversarial examples that reside in the physical world [45]. Understanding and analyzing the No Box transferability of adversarial examples to the physical world in order to provide public and academic visibility on these risks is an important direction for future research. Based on the known risks of designing new kinds of adversarial attacks discussed above we now outline the ways in which our research is informed by the intent to mitigate these potential societal risks. For instance, our research demonstrates that one can successfully craft adversarial attacks even in the challenging No Box setting. It raises many important considerations when developing robustness approaches. A straightforward extension is to consider our adversarial example game (AEG) framework as a tool for training robust models. On the theoretical side, exploring formal verification of neural networks against No Box adversaries is an exciting direction for continued exploration. As an application, ML practitioners in the industry may choose to employ new forms of A/B testing with different types of adversarial examples, of which AEG is one method to robustify and stress test production systems further. Such an application falls in line with other general approaches to red teaming AI systems [10] and verifiability in AI development. In essence, the goal of such approaches, including adversarial examples for robustness, is to align AI systems failure modes to those found in human decision making. Acknowledgments and Disclosure of Funding The authors would like to acknowledge Olivier Mastropietro, Chongli Qin and David Balduzzi for helpful discussions as well as Sebastian Lachapelle, Pouya Bashivan, Yanshuai Cao, Gavin Ding, Ioannis Mitliagkas, Nadeem Ward, and Damien Scieur for reviewing early drafts of this work. Funding. This work is partially supported by the Canada CIFAR AI Chair Program (held at Mila), NSERC Discovery Grant RGPIN-2019-05123 (held by Will Hamilton at Mc Gill), NSERC Discovery Grant RGPIN-2017-06936, an IVADO Fundamental Research Project grant PRF-2019-3583139727, and a Google Focused Research award (both held at U. Montreal by Simon Lacoste-Julien). Joey Bose was also supported by an IVADO Ph D fellowship, Gauthier Gidel by a Borealis AI fellowship and by the Canada Excellence Research Chair in "Data Science for Real-Time Decision-making" (held at Polytechnique by Andrea Lodi), and Andre Cianflone by a NSERC scholarship and a Borealis AI fellowship. Simon Lacoste-Julien and Pascal Vincent are CIFAR Associate Fellows in the Learning in Machines & Brains program. Finally, we thank Facebook for access to computational resources. Competing interests. Joey Bose was formerly at Face Shield.ai which was acquired in 2020. W.L. Hamilton was formerly a Visiting Researcher at Facebook AI Research. Simon Lacoste-Julien additionally works part time as the head of the SAIT AI Lab, Montreal from Samsung. [1] N. Akhtar and A. Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access, 6:14410 14430, 2018. [2] M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein. Square attack: a query-efficient black-box adversarial attack via random search. Sixteenth European Conference On Computer Vision (ECCV), 2020. [3] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Synthesizing robust adversarial examples. In The Thirty-fifth International Conference on Machine Learning (ICML), 2018. [4] S. Baluja and I. Fischer. Learning to attack: Adversarial transformation networks. In Thirtysecond aaai conference on artificial intelligence (AAAI), 2018. [5] H. Berard, G. Gidel, A. Almahairi, P. Vincent, and S. Lacoste-Julien. A closer look at the optimization landscapes of generative adversarial networks. In Eighth International Conference on Learning Representations (ICLR), 2020. [6] S. Bhambri, S. Muku, A. Tulasi, and A. B. Buduru. A survey of black-box adversarial attacks on computer vision models. ar Xiv preprint ar Xiv:1912.01667, 2019. [7] P. Billingsley. Convergence of probability measures. John Wiley & Sons, 1999. [8] A. J. Bose and P. Aarabi. Adversarial attacks on face detectors using neural net based constrained optimization. In 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP). IEEE, 2018. [9] A. J. Bose, A. Cianflone, and W. Hamiltion. Generalizable adversarial attacks using generative models. ar Xiv preprint ar Xiv:1905.10864, 2019. [10] M. Brückner and T. Scheffer. Stackelberg games for adversarial prediction problems. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011. [11] M. Brückner, C. Kanzow, and T. Scheffer. Static prediction games for adversarial learning problems. The Journal of Machine Learning Research, 13(1):2617 2654, 2012. [12] J. Buckman, A. Roy, C. Raffel, and I. Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. In International Conference on Learning Representations (ICLR 2018), 2018. [13] N. Carlini and D. Wagner. Magnet and efficient defenses against adversarial attacks are not robust to adversarial examples. ar Xiv preprint ar Xiv:1711.08478, 2017. [14] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp). IEEE, 2017. [15] N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras, I. Goodfellow, A. Madry, and A. Kurakin. On evaluating adversarial robustness. ar Xiv preprint ar Xiv:1902.06705, 2019. [16] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay. Adversarial attacks and defences: A survey. ar Xiv preprint ar Xiv:1810.00069, 2018. [17] Z. Charles, H. Rosenberg, and D. Papailiopoulos. A geometric perspective on the transferability of adversarial directions. In The Twenty-Second International Conference on Artificial Intelligence and Statistics, 2019. [18] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the tenth ACM Workshop on Artificial Intelligence and Security. ACM, 2017. [19] F. Croce and M. Hein. Minimally distorted adversarial examples with a fast adaptive boundary attack. Eighth International Conference on Learning Representations (ICLR), 2019. [20] F. Croce and M. Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. Thirty-seventh International Conference on Machine Learning (ICML), 2020. [21] G. S. Dhillon, K. Azizzadenesheli, Z. C. Lipton, J. Bernstein, J. Kossaifi, A. Khanna, and A. Anandkumar. Stochastic activation pruning for robust adversarial defense. Sixth International Conference on Learning Representations (ICLR), 2018. [22] G. W. Ding, L. Wang, and X. Jin. Adver Torch v0.1: An adversarial robustness toolbox based on pytorch. ar Xiv preprint ar Xiv:1902.07623, 2019. [23] G. W. Ding, Y. Sharma, K. Y. C. Lui, and R. Huang. MMA training: Direct input space margin maximization through adversarial training. In International Conference on Learning Representations, 2020. [24] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018. [25] Y. Dong, T. Pang, H. Su, and J. Zhu. Evading defenses to transferable adversarial examples by translation-invariant attacks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. [26] J. Du, H. Zhang, J. T. Zhou, Y. Yang, and J. Feng. Query-efficient meta attack to deep neural networks. Eighth International Conference on Learning Representations (ICLR 2020), 2020. [27] A. Erraqabi, A. Baratin, Y. Bengio, and S. Lacoste-Julien. A3t: Adversarially augmented adversarial training. ar Xiv preprint ar Xiv:1801.04055, 2018. [28] K. Fan. Minimax theorems. Proceedings of the National Academy of Sciences of the United States of America, 1953. [29] G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien. A variational inequality perspective on generative adversarial networks. In Seventh International Conference on Learning Representations (ICLR), 2019. [30] G. Gidel, D. Balduzzi, W. M. Czarnecki, M. Garnelo, and Y. Bachrach. Minimax theorem for latent games or: How i learned to stop worrying about mixed-nash and love neural nets. ar Xiv preprint ar Xiv:2002.05820, 2020. [31] Z. Gong, W. Wang, and W.-S. Ku. Adversarial and clean data are not twins. ar Xiv preprint ar Xiv:1705.04960, 2017. [32] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, 2014. [33] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. Third International Conference of Learning Representations (ICLR), 2015. [34] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. Mc Daniel. On the (statistical) detection of adversarial examples. ar Xiv preprint ar Xiv:1702.06280, 2017. [35] C. Guo, M. Rana, M. Cisse, and L. Van Der Maaten. Countering adversarial images using input transformations. Sixth International Conference on Learning Representations (ICLR), 2018. [36] C. Guo, J. R. Gardner, Y. You, A. G. Wilson, and K. Q. Weinberger. Simple black-box adversarial attacks. In Thirthy-Sixth International Conference on Machine Learning (ICML), 2019. [37] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. [38] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 1991. [39] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. [40] G. Huang, H. Berard, A. Touati, G. Gidel, P. Vincent, and S. Lacoste-Julien. Parametric adversarial divergences are good task losses for generative modeling. Sixth International Conference on Learning Representations, 2018. [41] Z. Huang and T. Zhang. Black-box adversarial attack with transferable model-based embedding. Eighth International Conference on Learning Representations (ICLR), 2020. [42] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Query-efficient black-box adversarial examples. ar Xiv preprint ar Xiv:1712.07113, 2017. [43] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Black-box adversarial attacks with limited queries and information. Thirty-fifth International Conference on Machine Learning (ICML), 2018. [44] A. Ilyas, L. Engstrom, and A. Madry. Prior convictions: Black-box adversarial attacks with bandits and priors. Seventh International Conference on Learning Representations (ICLR), 2019. [45] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. Fifth International Conference on Learning Representations (ICLR), 2017. [46] L. Jiang, X. Ma, S. Chen, J. Bailey, and Y.-G. Jiang. Black-box adversarial attacks on video recognition models. In Proceedings of the twenty-seventh ACM International Conference on Multimedia, 2019. [47] Y. Le Cun et al. Lenet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet, 2015. [48] Y. Li, S. Bai, Y. Zhou, C. Xie, Z. Zhang, and A. Yuille. Learning transferable adversarial examples via ghost networks. Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), 2018. [49] Y. Li, L. Li, L. Wang, T. Zhang, and B. Gong. Nattack: Learning the distributions of adversarial examples for an improved black-box attack on deep neural networks. Thirty-Sixth International Conference on Machine Learning (ICML), 2019. [50] Y. Li, S. Bai, C. Xie, Z. Liao, X. Shen, and A. L. Yuille. Regional homogeneity: Towards learning transferable universal adversarial perturbations against defenses. European Conference on Computer Vision (ECCV), 2020. [51] Y. Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial examples and black-box attacks. ar Xiv preprint ar Xiv:1611.02770, 2016. [52] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. Fifth International Conference on Learning Representations (ICLR), 2017. [53] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. Sixth International Conference on Learning Representations (ICLR), 2017. [54] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff. On detecting adversarial perturbations. Sixth International Conference on Learning Representations (ICLR), 2017. [55] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. [56] J. Nash. Non-cooperative games. Annals of mathematics, 1951. [57] N. Papernot, P. Mc Daniel, and I. Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. ar Xiv preprint ar Xiv:1605.07277, 2016. [58] N. Papernot, P. Mc Daniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. ACM, 2017. [59] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, 2019. [60] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 2011. [61] Y. Shi, S. Wang, and Y. Han. Curls & whey: Boosting black-box adversarial attacks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. [62] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. Third International Conference on Learning Representations (ICLR), 2015. [63] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. Sixth International Conference on Learning Representations (ICLR), 2018. [64] L. Sun, M. Tan, and Z. Zhou. A survey of practical adversarial example attacks. Cybersecurity, 1(1):9, 2018. [65] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. Second International Conference on Learning Representations (ICLR), 2014. [66] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. [67] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. Mc Daniel. Ensemble adversarial training: Attacks and defenses. Sixth International Conference on Learning Representations (ICLR), 2018. [68] C.-C. Tu, P. Ting, P.-Y. Chen, S. Liu, H. Zhang, J. Yi, C.-J. Hsieh, and S.-M. Cheng. Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 2019. [69] S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien. Painless stochastic gradient: Interpolation, line-search, and convergence rates. In Advances in Neural Information Processing Systems, 2019. [70] A. Wald. Statistical decision functions which minimize the maximum risk. Annals of Mathematics, 1945. [71] D. Wu, Y. Wang, S.-T. Xia, J. Bailey, and X. Ma. Skip connections matter: On the transferability of adversarial examples generated with resnets. Eigth International Conference on Learning Representations (ICLR), 2020. [72] C. Xiao, B. Li, J.-Y. Zhu, W. He, M. Liu, and D. Song. Generating adversarial examples with adversarial networks. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI), 2018. [73] C. Xie, Z. Zhang, Y. Zhou, S. Bai, J. Wang, Z. Ren, and A. L. Yuille. Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. [74] H. Xu, Y. Ma, H. Liu, D. Deb, H. Liu, J. Tang, and A. Jain. Adversarial attacks and defenses in images, graphs and text: A review. ar Xiv preprint ar Xiv:1909.08072, 2019. [75] S. Zagoruyko and N. Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC). BMVA Press, September 2016. [76] Z. Zhao, D. Dua, and S. Singh. Generating natural adversarial examples. Sixth International Conference on Learning Representations, 2018.