# accurate_reliable_and_fast_robustness_evaluation__2fbbf653.pdf

Accurate, reliable and fast robustness evaluation

Wieland Brendel1,3 Jonas Rauber1-3 Matthias Kümmerer1-3 Ivan Ustyuzhaninov1-3

Matthias Bethge1,3,4

1Centre for Integrative Neuroscience, University of Tübingen 2International Max Planck Research School for Intelligent Systems 3Bernstein Center for Computational Neuroscience Tübingen 4Max Planck Institute for Biological Cybernetics wieland.brendel@bethgelab.org

Throughout the past ﬁve years, the susceptibility of neural networks to minimal adversarial perturbations has moved from a peculiar phenomenon to a core issue in Deep Learning. Despite much attention, however, progress towards more robust models is signiﬁcantly impaired by the difﬁculty of evaluating the robustness of neural network models. Today s methods are either fast but brittle (gradient-based attacks), or they are fairly reliable but slow (scoreand decision-based attacks). We here develop a new set of gradient-based adversarial attacks which (a) are more reliable in the face of gradient-masking than other gradient-based attacks, (b) perform better and are more query efﬁcient than current state-of-the-art gradientbased attacks, (c) can be ﬂexibly adapted to a wide range of adversarial criteria and (d) require virtually no hyperparameter tuning. These ﬁndings are carefully validated across a diverse set of six different models and hold for L0, L1, L2 and L in both targeted as well as untargeted scenarios. Implementations will soon be available in all major toolboxes (Foolbox, Clever Hans and ART). We hope that this class of attacks will make robustness evaluations easier and more reliable, thus contributing to more signal in the search for more robust machine learning models.

1 Introduction

Manipulating just a few pixels in an input can easily derail the predictions of a deep neural network (DNN). This susceptibility threatens deployed machine learning models and highlights a gap between human and machine perception. This phenomenon has been intensely studied since its discovery in Deep Learning [Szegedy et al., 2014] but progress has been slow [Athalye et al., 2018a].

One core issue behind this lack of progress is the shortage of tools to reliably evaluate the robustness of machine learning models. Almost all published defenses against adversarial perturbations have later been found to be ineffective [Athalye et al., 2018a]: the models just appeared robust on the surface because standard adversarial attacks failed to ﬁnd the true minimal adversarial perturbations against them. State-of-the-art attacks like PGD [Madry et al., 2018] or C&W [Carlini and Wagner, 2016] may fail for a number of reasons, ranging from (1) suboptimal hyperparameters over (2) an insufﬁcient number of optimization steps to (3) masking of the backpropagated gradients.

In this paper, we adopt ideas from the decision-based boundary attack [Brendel et al., 2018] and combine them with gradient-based estimates of the boundary. The resulting class of gradient-based attacks surpasses current state-of-the-art methods in terms of attack success, query efﬁciency and

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

Multi-step view

standard gradient-based methods our method

pixel value #1

pixel value #2

clean image

starting point

adversarial image

pixel value #2 pixel value #2

Find optimal step k-1 k that

(1) minimizes distance to clean image (2) stays within trust region (3) stays within pixel bounds (4) stays on decision boundary

Single-step view

Figure 1: Schematic of our approach. Consider a two-pixel input which a model either interprets as a dog (shaded region) or as a cat (white region). Given a clean dog image (solid triangle), we search for the closest image classiﬁed as a cat. Standard gradient-based attacks start somewhere near the clean image and perform gradient descent towards the boundary (left). Our attacks start from an adversarial image far away from the clean image and walk along the boundary towards the closest adversarial (middle). In each step, we solve an optimization problem to ﬁnd the optimal descent direction along the boundary that stays within the valid pixel bounds and the trust region (right).

reliability. Like the decision-based boundary attack, but unlike existing gradient-based attacks, our attacks start from a point far away from the clean input and follow the boundary between the adversarial and non-adversarial region towards the clean input, Figure 1 (middle). This approach has several advantages: ﬁrst, we always stay close to the decision boundary of the model, the most likely region to feature reliable gradient information. Second, instead of minimizing some surrogate loss (e.g. a weighted combination of the cross-entropy and the distance loss), we can formulate a clean quadratic optimization problem. Its solution relies on the local plane of the boundary to estimate the optimal step towards the clean input under the given Lp norm and the pixel bounds, see Figure 1 (right). Third, because we always stay close to the boundary, our method features only a single hyperparameter (the trust region) but no other trade-off parameters as in C&W or a ﬁxed Lp norm ball as in PGD. We tested our attacks against the current state-of-the-art in the L0, L1, L2 and L metric on two conditions (targeted and untargeted) on six different models (of which all but the vanilla Res Net-50 are defended) across three different data sets. To make all comparisons as fair as possible, we conducted a large-scale hyperparameter tuning for each attack. In almost all cases tested, we ﬁnd that our attacks outperform the current state-of-the-art in terms of attack success, query efﬁciency and robustness to suboptimal hyperparameter settings. We hope that these improvements will facilitate progress towards robust machine learning models.

2 Related work

Gradient-based attacks are the most widely used tools to evaluate model robustness due to their efﬁciency and success rate relative to other classes of attacks with less model information (like decision-based, score-based or transfer-based attacks, see [Brendel et al., 2018]). This class includes many of the best-known attacks such as L-BFGS [Szegedy et al., 2014], FGSM [Goodfellow et al., 2015], JSMA [Papernot et al., 2016], Deep Fool [Moosavi-Dezfooli et al., 2016], PGD [Kurakin et al., 2016, Madry et al., 2018], C&W [Carlini and Wagner, 2016], EAD [Chen et al., 2017] and Sparse Fool [Modas et al., 2019]. Nowadays, the two most important ones are PGD with a random starting point [Madry et al., 2018] and C&W [Carlini and Wagner, 2016]. They are usually considered the state of the art for L (PGD) and L2 (CW). The other ones are either much weaker (FGSM, Deep Fool) or minimize other norms, e.g. L0 (JSMA, Sparse Fool) or L1 (EAD). More recently, there have been some improvements to PGD that aim at making it more effective and/or more query-efﬁcient by changing its update rule to Adam [Uesato et al., 2018] or momentum [Dong et al., 2018].

3 Attack algorithm

Our attacks are inspired by the decision-based boundary attack [Brendel et al., 2018] but use gradients to estimate the local boundary between adversarial and non-adversarial inputs. We will refer to this

boundary as the adversarial boundary for the rest of this manuscript. In a nutshell, the attack starts from an adversarial input x0 (which may be far away from the clean sample) and then follows the adversarial boundary towards the clean input x, see Figure 1 (middle). To compute the optimal step in each iteration, Figure 1 (right), we solve a quadratic trust region optimization problem. The goal of this optimization problem is to ﬁnd a step δk such that (1) the updated perturbation xk = xk 1 + δk

has minimal Lp distance to the clean input x, (2) the size δk 2 2 of the step is smaller than a given trust region radius r, (3) the updated perturbation stays within the box-constraints of the valid input value range (e.g. [0, 1] or [0, 255] for input) and (4) the updated perturbation xk is approximately placed on the adversarial boundary.

Optimization problem In mathematical terms, this optimization problem can be phrased as

x xk 1 δk p s.t. 0 xk 1 + δk 1 bk δk = ck δk 2

where . p denotes the Lp norm and bk denotes the estimate of the normal vector of the local boundary (see Figure 1) around xk 1 (see below for details). The problem can be solved for p = 0, 1, with off-the-shelf solvers like ECOS [Domahidi et al., 2013] or SCS [O Donoghue et al., 2016] but the runtime of these solvers as well as their numerical instabilities in high dimensions prohibits their use in practice. We therefore derived efﬁcient iterative algorithms based on the dual of (1) to solve Eq. (1) for L0, L1, L2 and L . The additional optimization step (1) has little impact on the runtime of our attack compared to standard iterative gradient-based attacks like PGD. We report the details of the derivation and the resulting algorithms in the supplementary materials.

Adversarial criterion Our attacks move along the adversarial boundary to minimize the distance to the clean input. We assume that this boundary can be deﬁned by a differentiable equality constraint adv( x) = 0, i.e. the manifold that deﬁnes the boundary is given by the set of inputs { x| adv( x) = 0}. No other assumptions about the adversarial boundary are being made. Common choices for adv(.) are targeted or untargeted adversarials, deﬁned by perturbations that switch the model prediction from the ground-truth label y to either a speciﬁed target label t (targeted scenario) or any other label t = y (untargeted scenario). More precisely, let m( x) RC be the class-conditional log-probabilities predicted by model m(.) on the input x. Then adv( x) = m( x)y m( x)t is the criterion for targeted adversarials and adv( x) = mint,t =y(my( x) mt( x)) for untargeted adversarials.

The direction of the boundary bk in step k at point xk 1 is deﬁned as the derivative of adv(.),

bk = xk 1 adv( xk 1). (2) Hence, any step δk for which bk δk = adv( xk 1) will move the perturbation xk = xk 1 + δk onto the adversarial boundary (if the linearity assumption holds exactly). In Eq. (1), we deﬁned ck adv( xk 1) for brevity. Finally, we note that in the targeted and untargeted scenarios, we compute gradients for the same loss found to be most effective in Carlini and Wagner [2016]. In our case, this loss is naturally derived from a geometric perspective of the adversarial boundary.

Starting point The algorithm always starts from a point x0 that is typically far away from the clean image and lies in the adversarial region. There are several straight-forward ways to ﬁnd such starting points, e.g. by (1) sampling random noise inputs, (2) choosing a real sample that is part of the adversarial region (e.g. is classiﬁed as a given target class) or (3) choosing the output of another adversarial attack.

In all experiments presented in this paper, we choose the starting point as the closest sample (in terms of the L2 norm) to the clean input which was classiﬁed differently (in untargeted settings) or classiﬁed as the desired target class (in targeted settings) by the given model. After ﬁnding a suitable starting point, we perform a binary search with a maximum of 10 steps between the clean input and the starting point to ﬁnd the adversarial boundary. From this point, we perform an iterative descent along the boundary towards the clean input. Algorithm 1 provides a compact summary of the attack procedure.

We extensively compare the proposed attack against current state-of-the art attacks in a range of different scenarios. This includes six different models (varying in model architecture, defense

Algorithm 1: Schematic of our attacks. Data: clean input x, differentiable adversarial criterion adv(.), adversarial starting point x0

Result: adversarial example x such that the distance d(x, xk) = x xk p is minimized begin

k 0 b0 0 if no x0 is given: x0 U(0, 1) s.t. x0 is adversarial (or sample from adv. class) while k < maximum number of steps do

bk := xk 1 adv( xk 1) // estimate local geometry of adversarial boundary ck := adv( xk 1) // estimate distance to adversarial boundary δk solve optimization problem Eq. (1) for given Lp norm xk xk 1 + δk k k + 1 end end

mechanism and data set), two different adversarial categories (targeted and untargeted) and four different metrics (L0, L1, L2 and L ). In addition, we perform a large-scale hyperparameter tuning for all attacks we compare against in order to be as fair as possible. The full analysis pipeline is built on top of Foolbox [Rauber et al., 2017].

Attacks We compare against several attacks which are considered to be state-of-the-art in L0, L1, L2 and L :

Projected Gradient Descent (PGD) [Madry et al., 2018]. Iterative gradient attack that optimizes L by minimizing a cross-entropy loss under a ﬁxed L norm constraint enforced in each step.

Projected Gradient Descent with Adam (Adam PGD) [Uesato et al., 2018]. Same as PGD but with Adam Optimiser for update steps.

C&W [Carlini and Wagner, 2016]. L2 iterative gradient attack that relies on the Adam optimizer, a tanh-nonlinearity to respect pixel-constraints and a loss function that weighs a classiﬁcation loss with the distance metric to be minimized.

Decoupling Direction and Norm Attack (DDN) [Rony et al., 2018]. L2 iterative gradient attack pitched as a query-efﬁcient alternative to the C&W attack that requires less hyperparameter tuning.

EAD [Chen et al., 2018]. Variation of C&W adapted for elastic net metrics. We run the attack with high regularisation value (1e 2) to approach the optimal L1 performance.

Saliency-Map Attack (JSMA) [Papernot et al., 2016]. L0/L1 attack that iterates over saliency maps to discover pixels with the highest potential to change the decision of the classiﬁer.

Sparse-Fool [Modas et al., 2019]. A sparse version of Deep Fool, which uses a local linear approximation of the geometry of the adversarial boundary to estimate the optimal step towards the boundary.

Models We test all attacks on all models regardless as to whether the models have been speciﬁcally defended against the distance metric the attacks are optimizing. The sole goal is to evaluate all attacks on a maximally broad set of different models to ensure their wide applicability. For all models, we used the ofﬁcial implementations of the authors as available in the Foolbox model zoo [Rauber et al., 2017].

Madry-MNIST [Madry et al., 2018]: Adversarially trained model on MNIST. Claim: 89.62% ( L perturbation 0.3). Best third-party evaluation: 88.42% [Wang et al., 2018].

Madry-CIFAR [Madry et al., 2018]: Adversarially trained model on CIFAR-10. Claim: 47.04% (L perturbation 8/255). Best third-party evaluation: 44.71% [Zheng et al., 2018].

Distillation [Papernot et al., 2015]: Defense (MNIST) with increased softmax temperature. Claim: 99.06% (L0 perturbation 112). Best third-party evaluation: 3.6% [Carlini and Wagner, 2016].

Logitpairing [Kannan et al., 2018]: Variant of adversarial training on downscaled Image NET (64 x 64 pixels) using the logit vector instead of cross-entropy. Claim: 27.9% (L perturbation 16/255). Best third-party evaluation: 0.6% [Engstrom et al., 2018].

Kolter & Wong [Kolter and Wong, 2017]: Provable defense that considers a convex outer approximation of the possible hidden activations within an Lp ball to optimize a worst-case adversarial loss over this region. MNIST claims: 94.2% (L perturbations 0.1).

Res Net-50 [He et al., 2016]: Standard vanilla Res Net-50 model trained on Image NET that reaches 50% for L2 perturbations 1 10 7 [Brendel et al., 2018].

Adversarial categories We test all attacks in two common attack scenarios: untargeted and targeted attacks. In other words, perturbed inputs are classiﬁed as adversarials if they are classiﬁed differently from the ground-truth label (untargeted) or are classiﬁed as a given target class (targeted).

Hyperparameter tuning We ran all attacks on each model/attack combination and each sample with ﬁve repetitions and a large range of potentially interesting hyperparameter settings, resulting to between one (Sparse Fool) and 96 (C& W) hyperparameter settings we test for each attack. In the appendix we list all hyperparameters and hyperparameter ranges for each attack.

Evaluation The success of an L attack is typically quantiﬁed as the attack success rate within a given L norm ball. In other words, the attack is allowed to perturb the clean input with a maximum L norm of ϵ and one measures the classiﬁcation accuracy of the model on the perturbed inputs. The smaller the classiﬁcation accuracy the better performed the attack. PGD [Madry et al., 2018] and Adam PGD [Uesato et al., 2018] are highly adapted to this scenario and expect ϵ as an input.

This contrasts with most L0, L1 and L2 attacks like C&W [Carlini and Wagner, 2016] or Sparse Fool [Modas et al., 2019] which are designed to ﬁnd minimal adversarial perturbations. In such scenarios, it is more natural to measure the success of an attack as the median over the adversarial perturbation sizes across all tested samples [Schott et al., 2019]. The smaller the median perturbations the better the attack.

Our attacks also seek minimal adversarials and thus lend themselves to both evaluation schemes. To make the comparison to the current state-of-the-art as fair as possible, we adopt the success rate criterion on L and the median perturbation distance on L0, L1 and L2.

All results reported have been evaluated on 1000 validation samples. For the L evaluation, we chose ϵ for each model and each attack scenario such that the best attack performance reaches roughly 50% accuracy. This makes it easier to compare the performance of different attacks (compared to thresholds at which model accuracy is close to zero or close to clean performance). In the untargeted scenario, we chose ϵ = 0.33, 0.15, 0.1, 0.03, 0.0015, 0.0006 in the untargeted and ϵ = 0.35, 0.2, 0.15, 0.06, 0.04, 0.002 in the targeted scenarios for Madry-MNIST, Kolter & Wong, Distillation, Madry-CIFAR, Logitpairing and Res Net-50, respectively.

5.1 Attack success

In both targeted as well as untargeted attack scenarios, our attacks surpass the current state-ofthe-art on every single model we tested, see Table 1 (untargeted) and Table 2 (targeted) (with the Logitpairing in the targeted L scenario being the only exception). While the gains are small on some model/metric combinations like Distillation or Madry-CIFAR on L2, we reach quite substantial gains on many others: on Madry-MNIST, our untargeted L2 attack reaches median perturbation sizes of 1.15 compared to 3.24 for C&W. In the targeted scenario, the difference is even more pronounced (1.70 vs 4.79). On L , our attack further reduces the model accuracy by 0.1% to 14.0% relative to

Madry-MNIST (L ) Madry-CIFAR (L ) Logit Pairing (L )

Kolter & Wong (L2) Distillation (L2) Res Net-50 (L2)

Clean Adversarial Diﬀerence Clean Adversarial Diﬀerence

Figure 2: Randomly selected adversarial examples found by our L2 and L attacks for each model. The top part shows adversarial examples that minimize the L norm while the bottom row shows adversarial examples that minimize the L2 norm. Adversarial examples optimised with out L0 and L1 attacks are displayed in the appendix.

PGD. On L1 and L0 our gains are particularly drastic: while the Saliency Map attack and Sparse Fool often fail on the defended models, our attack reaches close to 100% attack success on all models while reaching adversarials that are up to one to two orders smaller. Even the current state-of-the-art on L1, EAD [Chen et al., 2018], is up to a factor six worse than our attack. Adversarial examples produced by our attacks are visualized in Figure 2 (for L2 and L ) and in the supplementary material (for L1 and L0).

MNIST CIFAR-10 Image Net Madry-MNIST K&W Distillation Madry-CIFAR LP Res Net-50 PGD 60.1% 76.5% 32.1% 57.1% 53.3% 51.0% Adam PGD 53.4% 72.5% 31.3% 57.1% 53.5% 50.2% Ours-L 49.1% 69.5% 31.2% 57.0% 42.5% 37.0% C&W 3.24 2.78 1.09 0.75 0.10 0.14 DDN 1.59 1.95 1.07 0.73 0.15 0.24 Ours-L2 1.15 1.62 1.07 0.72 0.09 0.13 EAD 0.01931 0.02346 0.00768 0.00285 0.00013 Sparse Fool 0.11393 0.32114 0.48129 0.47687 0.49915 Saliency Map 0.04114 0.03730 0.02482 0.00292 0.00297 ours-L1 0.00377 0.00707 0.00698 0.00116 0.00008 Sparse Fool 1.00000 1.00000 1.00000 1.00000 1.00000 Saliency Map 0.22832 0.14732 0.08291 0.03483 0.00647 ours-L0 0.07143 0.06250 0.00765 0.00228 0.00024 Table 1: Attack success in untargeted scenario. Model accuracies (ﬁrst block) and median adversarial perturbation distance (all other blocks) in untargeted attack scenarios. Smaller is better. Sparse Fool and Saliency Map attacks did not always ﬁnd sufﬁciently many adversarials to compute an overall score.

untargeted targeted untargeted targeted

Madry-MNIST Kolter & Wong Distillation Madry-CIFAR Logitpairing Res Net-50

Figure 3: Query-Success curves for all model/attack combinations in the targeted and untargeted scenario for L2 and L metric (see supplementary information for L0 and L1 metric). Each curve shows the attack success either in terms of model accuracy (for L , left part) or median adversarial perturbation size (for L2, right part) over the number of queries to the model. In both cases, lower is better. For each point on the curve, we selected the optimal hyperparameter. If no line is shown the attack success was lower than 50%. For all other points with less than 99% line is 50% transparent.

MNIST CIFAR-10 Image Net Madry-MNIST K&W Distillation Madry-CIFAR LP Res Net-50 PGD 65.6% 46.4% 53.2% 39.7% 1.5% 47.4% Adam PGD 59.8% 40.2% 52.3% 39.9% 0.6% 44.1% Ours-L 56.0% 39.8% 50.5% 37.6% 0.9% 37.0% C&W 4.79 4.06 2.09 1.20 0.70 0.44 DDN 2.22 2.89 2.09 1.19 0.58 0.64 Ours-L2 1.70 2.31 2.05 1.16 0.51 0.40 EAD 0.03648 0.04019 0.01808 0.00698 0.00221 Sparse Fool Saliency Map 0.05740 0.03160 0.00872 ours-L1 0.00499 0.00904 0.00925 0.00146 0.00085 Sparse Fool Saliency Map 0.13074 0.17793 0.12117 0.04036 ours-L0 0.08929 0.07908 0.01020 0.00293 0.01147 Table 2: Attack success in targeted scenario. Model accuracies (ﬁrst block) and median adversarial perturbation distance (all other blocks) in targeted attack scenarios. Smaller is better. Sparse Fool and Saliency Map attacks did not always ﬁnd sufﬁciently many adversarials to compute an overall score.

Figure 4: Sensitivity of our method to the number of repetitions and suboptimal hyperparameters.

5.2 Query efﬁciency

On L2, our attack is signiﬁcantly more query efﬁcient than C&W and at least on par with DDN, see the query-distortion curves in Figure 3. Each curve represents the maximal attack success (either in terms of model accuracy or median perturbation size) as a function of query budget. For each query (i.e. each point of the curve) and each model, we select the optimal hyperparameter. This ensures that the we tease out how good each attack can perform in limited-query scenarios. We ﬁnd that our L2 attack generally requires only about 10 to 20 queries to get close to convergence while C&W often needs several hundred iterations. Our attack performs particularly well on adversarially trained models like Madry-MNIST.

Similarly, our L attack generally surpasses PGD and Adam PGD in terms of attack success after around 10 queries. The ﬁrst few queries are typically required by our attack to ﬁnd a suitable initial point on the adversarial boundary. This gives PGD a slight advantage at the very beginning.

5.3 Hyperparameter robustness

In Figure 4, we show the results of an ablation study on L2 and L . In the full case (8 params + 5 reps), we run all our attacks against C&W as well as PGD with all hyperparameter values and with ﬁve repetitions for 1000 steps on each sample and model. We then choose the smallest adversarial input across all hyperparameter values and all repetitions. This is the baseline we compare all ablations against. The results are as follows:

Like PGD or C&W, our attacks experience only a 4% performance drop if a single hyperparameter is used instead of eight.

Our attacks experience around 15% - 19% drop in performance for a single hyperparameter and only one instead of ﬁve repetitions, similar to PGD and C&W.

We can even choose the same trust region hyperparameter across all models with no further drop in performance. C&W, in comparison, experiences a further 16% drop in performance, meaning it is more sensitive to per-model hyperparameter tuning.

Our attack is extremely insensitive to suboptimal hyperparameter tuning: changing the optimal trust region two orders of magnitude up or down changes performance by less than 15%. In comparison, just one order of magnitude deteriorates C&W performance by almost 50%. Larger deviations from the optimal learning rate disarm C&W completely. PGD is less sensitive than C&W but still experiences large drops if the learning rate gets too small.

6 Discussion & Conclusion

An important obstacle slowing down the search for robust machine learning models is the lack of reliable evaluation tools: out of roughly two hundred defenses proposed and evaluated in the literature, less than a handful are widely accepted as being effective. A more reliable evaluation of adversarial robustness has the potential to more clearly distinguish effective defenses from ineffective ones, thus providing more signal and thereby accelerating progress towards robust models.

In this paper, we introduced a novel class of gradient-based attacks that outperforms the current state-of-the-art in terms of attack success, query efﬁciency and reliability on L0, L1, L2 and L . By moving along the adversarial boundary, our attacks stay in a region with fairly reliable gradient information. Other methods like C&W which move through regions far away from the boundary might get stuck due to obfuscated gradients, a common issue for robustness evaluation [Athalye et al., 2018b].

Further extensions to other metrics (e.g. elastic net) are possible as long as the optimization problem Eq. (1) can be solved efﬁciently. Extensions to other adversarial criteria are trivial as long as the boundary between the adversarial and the non-adversarial region can be described by a differentiable equality constraint. This makes the attack more suitable to scenarios other than targeted or untargeted classiﬁcation tasks.

Taken together, our methods set a new standard for adversarial attacks that is useful for practitioners and researchers alike to ﬁnd more robust machine learning models.

Acknowledgments

This work has been funded, in part, by the German Federal Ministry of Education and Research (BMBF) through the Bernstein Computational Neuroscience Program Tübingen (FKZ: 01GQ1002) as well as the German Research Foundation (DFG CRC 1233 on Robust Vision ) and the Tübingen AI Center (FKZ: 01IS18039A). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting J.R., M.K. and I.U.; J.R. acknowledges support by the Bosch Forschungsstiftung (Stifterverband, T113/30057/17); M.B. acknowledges support by the Centre for Integrative Neuroscience Tübingen (EXC 307); W.B. and M.B. were supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (Do I/IBC) contract number D16PC00003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the ofﬁcial policies or endorsements, either expressed or implied, of IARPA, Do I/IBC, or the U.S. Government.

Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. Co RR, abs/1802.00420, 2018a. URL http://arxiv.org/abs/1802.00420.

Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. Co RR, abs/1802.00420, 2018b. URL http://arxiv.org/abs/1802.00420.

W. Brendel, J. Rauber, and M. Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. In International Conference on Learning Representations, 2018. URL https://arxiv.org/abs/1712.04248.

Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. Co RR, abs/1608.04644, 2016. URL http://arxiv.org/abs/1608.04644.

Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. Ead: elastic-net attacks to deep neural networks via adversarial examples. ar Xiv preprint ar Xiv:1709.04114, 2017.

Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. EAD: elastic-net attacks to deep neural networks via adversarial examples. In Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, (AAAI-18), the 30th innovative Applications of Artiﬁcial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artiﬁcial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 10 17, 2018.

Alexander Domahidi, Eric Chun-Pu Chu, and Stephen P. Boyd. Ecos: An socp solver for embedded systems. 2013 European Control Conference (ECC), pages 3071 3076, 2013.

Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

Logan Engstrom, Andrew Ilyas, and Anish Athalye. Evaluating and understanding the robustness of adversarial logit pairing. Co RR, abs/1807.10272, 2018. URL http://arxiv.org/abs/1807. 10272.

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. URL http://arxiv. org/abs/1412.6572.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770 778, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90.

Harini Kannan, Alexey Kurakin, and Ian J. Goodfellow. Adversarial logit pairing. Co RR, abs/1803.06373, 2018. URL http://arxiv.org/abs/1803.06373.

J. Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. Co RR, abs/1711.00851, 2017. URL http://arxiv.org/abs/1711. 00851.

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. ar Xiv preprint ar Xiv:1607.02533, 2016.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=r Jz IBf ZAb.

Apostolos Modas, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Sparsefool: a few pixels make a big difference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

B. O Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications, 169(3): 1042 1068, June 2016. URL http://stanford.edu/~boyd/papers/scs.html.

N. Papernot, P. Mc Daniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (Euro S P), pages 372 387, March 2016. doi: 10.1109/Euro SP.2016.36.

Nicolas Papernot, Patrick D. Mc Daniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. Co RR, abs/1511.04508, 2015. URL http://arxiv.org/abs/1511.04508.

Nicolas Papernot, Patrick Mc Daniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Security and Privacy (Euro S&P), 2016 IEEE European Symposium on, pages 372 387. IEEE, 2016.

Jonas Rauber, Wieland Brendel, and Matthias Bethge. Foolbox v0.8.0: A python toolbox to benchmark the robustness of machine learning models. Co RR, abs/1707.04131, 2017. URL http://arxiv.org/abs/1707.04131.

Jérôme Rony, Luiz G Hafemann, Luis S Oliveira, Ismail Ben Ayed, Robert Sabourin, and Eric Granger. Decoupling direction and norm for efﬁcient gradient-based l2 adversarial attacks and defenses. ar Xiv preprint ar Xiv:1811.09600, 2018.

Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the ﬁrst adversarially robust neural network model on MNIST. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=S1EHOs C9t X.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. URL http://arxiv.org/abs/1312.6199.

Jonathan Uesato, Brendan O Donoghue, Aaron van den Oord, and Pushmeet Kohli. Adversarial risk and the dangers of evaluating against weak attacks. ar Xiv preprint ar Xiv:1802.05666, 2018.

Shiqi Wang, Yizheng Chen, Ahmed Abdou, and Suman Jana. Mixtrain: Scalable training of formally robust neural networks. ar Xiv preprint ar Xiv:1811.02625, 2018.

Tianhang Zheng, Changyou Chen, and Kui Ren. Distributionally adversarial attack. Co RR, abs/1808.05537, 2018. URL http://arxiv.org/abs/1808.05537.