# are_adversarial_examples_inevitable__8e3f64c0.pdf

Under review as a conference paper at ICLR 2019

ARE ADVERSARIAL EXAMPLES INEVITABLE?

Anonymous authors Paper under double-blind review

A wide range of defenses have been proposed to harden neural networks against adversarial attacks. However, a pattern has emerged in which the majority of adversarial defenses are quickly broken by new attacks. Given the lack of success at generating robust defenses, we are led to ask a fundamental question: Are adversarial attacks inevitable? This paper analyzes adversarial examples from a theoretical perspective, and identiﬁes fundamental bounds on the susceptibility of a classiﬁer to adversarial attacks. We show that, for certain classes of problems, adversarial examples are inescapable. Using experiments, we explore the implications of theoretical guarantees for real-world problems and discuss how factors such as dimensionality and image complexity limit a classiﬁer s robustness against adversarial examples.

1 INTRODUCTION

A number of adversarial attacks on neural networks have been recently proposed. To counter these attacks, a number of authors have proposed a range of defenses. However, these defenses are often quickly broken by new and revised attacks. Given the lack of success at generating robust defenses, we are led to ask a fundamental question: Are adversarial attacks inevitable?

In this paper, we identify a broad class of problems for which adversarial examples cannot be avoided. We also derive fundamental limits on the susceptibility of a classiﬁer to adversarial attacks that depend on properties of the data distribution as well as the dimensionality of the dataset.

Adversarial examples occur when a small perturbation to an image changes its class label. There are different ways of measuring what it means for a perturbation to be small ; as such, our analysis considers a range of different norms. While the ℓ -norm is commonly used, adversarial examples can be crafted in any ℓp-norm (see Figure 1). We will see that the choice of norm can have a dramatic effect on the strength of theoretical guarantees for the existence of adversarial examples. Our analysis also extends to the ℓ0-norm, which yields sparse adversarial examples that only perturb a small subset of image pixels (Figure 2).

egyptian cat (28%)

trafﬁc light (97%)

ℓ -norm=0.05

trafﬁc light (96%)

ℓ0-norm=5000 (sparse)

trafﬁc light (80%)

Figure 1: Adversarial examples with different norm constraints formed via the projected gradient method (Madry et al., 2017) on Resnet50, along with the distance between the base image and the adversarial example, and the top class label.

Under review as a conference paper at ICLR 2019

As a simple example result, consider a classiﬁcation problem with n-dimensional images with pixels scaled between 0 and 1 (in this case images live inside the unit hypercube). If the image classes each occupy a fraction of the cube greater than 1

2 exp( πϵ2), then images exist that are susceptible to adversarial perturbations of ℓ2-norm at most ϵ. Note that ϵ = 10 was used in Figure 1, and larger values are typical for larger images.

Finally, in Section 8, we explore the causes of adversarial susceptibility in real datasets, and the effect of dimensionality. We present an example image class for which there is no fundamental link between dimensionality and robustness, and argue that the data distribution, and not dimensionality, is the primary cause of adversarial susceptibility.

1.1 BACKGROUND: A BRIEF HISTORY OF ADVERSARIAL EXAMPLES

Adversarial examples, ﬁrst demonstrated in Szegedy et al. (2013) and Biggio et al. (2013), change the label of an image using small and often imperceptible perturbations to its pixels. A number of defenses have been proposed to harden networks against attacks, but historically, these defenses have been quickly broken. Adversarial training, one of the earliest defenses, successfully thwarted the fast gradient sign method (FGSM) (Goodfellow et al., 2014), one of the earliest and simplest attacks. However, adversarial training with FGSM examples was quickly shown to be vulnerable to more sophisticated multi-stage attacks (Kurakin et al., 2016; Tram er et al., 2017a). More sophisticated defenses that rely on network distillation (Papernot et al., 2016b) and specialized activation functions (Zantedeschi et al., 2017) were also toppled by strong attacks (Papernot et al., 2016a; Tram er et al., 2017b; Carlini & Wagner, 2016; 2017a).

The ongoing vulnerability of classiﬁers was highlighted in recent work by Athalye et al. (2018) and Athalye & Sutskever (2017) that broke an entire suite of defenses presented in ICLR 2018 including thermometer encoding (Buckman et al., 2018), detection using local intrinsic dimensionality (Ma et al., 2018), input transformations such as compression and image quilting (Guo et al., 2017), stochastic activation pruning (Dhillon et al., 2018), adding randomization at inference time (Xie et al., 2017), enhancing the conﬁdence of image labels (Song et al., 2017), and using a generative model as a defense (Samangouei et al., 2018).

Rather than hardening classiﬁers to attacks, some authors have proposed sanitizing datasets to remove adversarial perturbations before classiﬁcation. Approaches based on auto-encoders (Meng & Chen, 2017) and GANs (Shen et al., 2017) were broken using optimization-based attacks (Carlini & Wagner, 2017b;a).

A number of certiﬁable defense mechanisms have been developed for certain classiﬁers. Raghunathan et al. (2018) harden a two-layer classiﬁer using semideﬁnite programming, and Sinha et al. (2018) propose a convex duality-based approach to adversarial training that works on sufﬁciently small adversarial perturbations with a quadratic adversarial loss. Kolter & Wong (2017) consider training a robust classiﬁer using the convex outer adversarial polytope. All of these methods only consider robustness of the classiﬁer on the training set, and robustness properties often fail to generalize reliably to test examples.

One place where researchers have enjoyed success is at training classiﬁers on low-dimensional datasets like MNIST (Madry et al., 2017; Sinha et al., 2018). The robustness achieved on more complicated datasets such

ℓ0-norm (sparse)

sparse perturbation

Figure 2: Sparse adversarial examples perturb a small subset of pixels and can hide adversarial fuzz inside highfrequency image regions. The original image (left) is classiﬁed as an ox. Under ℓ -norm perturbations, it is classiﬁed as trafﬁc light , but the perturbations visibly distort smooth regions of the image (the sky). These effects are hidden in the grass using ℓ0-norm (sparse) perturbations limited to a small subset of pixels.

Under review as a conference paper at ICLR 2019

as CIFAR-10 and Image Net are nowhere near that of MNIST, which leads some researchers to speculate that adversarial defense is fundamentally harder in higher dimensions an issue we address in Section 8.

This paper uses well-known results from high-dimensional geometry, speciﬁcally isoperimetric inequalities, to provide bounds on the robustness of classiﬁers. Several other authors have investigated adversarial susceptibility through the lens of geometry. Fawzi et al. (2018) study adversarial susceptibility of datasets under the assumption that they are produced by a generative model that maps random Gaussian vectors onto images. Gilmer et al. (2018) do a detailed case study, including empirical and theoretical results, of classiﬁers for a synthetic dataset that lies on two concentric spheres. Simon-Gabriel et al. (2018) show that the Lipschitz constant of untrained networks with random weights gets large in high dimensions. Shortly after the original appearance of our work, Mahloujifar et al. (2018) presented a study of adversarial susceptibility that included both evasion and poisoning attacks. Our work is distinct in that it studies adversarial robustness for arbitrary data distributions, and also that it rigorously looks at the effect of dimensionality on robustness limits.

1.2 NOTATION

We use [0, 1]n to denote the unit hypercube in n dimensions, and vol(A) to denote the volume (i.e., ndimensional Lebesgue measure) of a subset A [0, 1]n. We use Sn 1 = {x Rn| x 2 = 1} to denote the unit sphere embedded in Rn, and sn 1 to denote its surface area. The size of a subset A Sn 1 can be quantiﬁed by its (n 1 dimensional) measure µ[A], which is simply the surface area the set covers. Because the surface area of the unit sphere varies greatly with n, it is much easier in practice to work with the normalized measure, which we denote µ1[A] = µ[A]/sn 1. This normalized measure has the property that µ1[Sn 1] = 1, and so we can interpret µ1[A] as the probability of a uniform random point from the sphere lying in A. When working with points on a sphere, we often use geodesic distance, which is always somewhat larger than (but comparable to) the Euclidean distance. In the cube, we measure distance between points using ℓp-norms, which are denoted

i |zi|p !1/p

if p > 0, and z 0 = card{zi | zi = 0}.

Note that p is not truly a norm for p < 1, but rather a semi-norm. Such metrics are still commonly used, particularly the ℓ0-norm which counts the number of non-zero entries in a vector.

2 PROBLEM SETUP

We consider the problem of classifying data points that lie in a space Ω(either a sphere or a hypercube) into m different object classes. The m object classes are deﬁned by probability density functions {ρc}m c=1, where ρc : Ω R. A random point from class c is a random variable with density ρc. We assume ρc to be bounded (i.e., we don t allow delta functions or other generalized functions), and denote its upper bound by Uc = supx ρc(x).

We also consider a classiﬁer function C : Ω {1, 2, . . . , m} that partitions Ωinto disjoint measurable subsets, one for each class label. The classiﬁer we consider is discrete valued it provides a label for each data point but not a conﬁdence level.

With this setup, we can give a formal deﬁnition of an adversarial example.

Deﬁnition 1. Consider a point x Ωdrawn from class c, a scalar ϵ > 0, and a metric d. We say that x admits an ϵ-adversarial example in the metric d if there exists a point ˆx Ωwith C(ˆx) = c, and d(x, ˆx) ϵ.

In plain words, a point has an ϵ-adversarial example if we can sneak it into a different class by moving it at most ϵ units in the distance d.

We consider adversarial examples with respect to different ℓp-norm metrics. These metrics are written dp(x, ˆx) = x ˆx p. A common choice is p = , which limits the absolute change that can be made to any one pixel. However, ℓ2-norm and ℓ1-norm adversarial examples are also used, as it is frequently easier to create adversarial examples in these less restrictive metrics.

Under review as a conference paper at ICLR 2019

0.0 0.2 0.4 0.6 0.8 1.0 ϵ

dimension=100

experiment theory

0.0 0.2 0.4 0.6 0.8 1.0 ϵ

0.0 0.2 0.4 0.6 0.8 1.0 ϵ

Figure 3: The ϵ-expansion of a half sphere nearly covers the whole sphere for small ϵ and large n. Visualizations show the fraction of the sphere captured within ϵ units of a half sphere in different dimensions. Results from a near-exact experimental method are compared to the theoretical lower bound in Lemma 2.

We also consider sparse adversarial examples in which only a small subset of pixels are manipulated. This corresponds to the metric d0, in which case the constraint x ˆx 0 ϵ means that an adversarial example was crafted by changing at most ϵ pixels, and leaving the others alone.

3 THE SIMPLE CASE: ADVERSARIAL EXAMPLES ON THE UNIT SPHERE

We begin by looking at the case of classiﬁers for data on the sphere. While this data model may be less relevant than the other models studied below, it provides a straightforward case where results can be proven using simple, geometric lemmas. The more realistic case of images with pixels in [0, 1] will be studied in Section 4.

The idea is to show that, provided a class of data points takes up enough space, nearly every point in the class lies close to the class boundary. To show this, we begin with a simple deﬁnition. Deﬁnition 2. The ϵ-expansion of a subset A Ωwith respect to distance metric d, denoted A(ϵ, d), contains all points that are at most ϵ units away from A. To be precise

A(ϵ, d) = {x Ω| d(x, y) ϵ for some y A}.

We sometimes simply write A(ϵ) when the distance metric is clear from context.

Our result provides bounds on the probability of adversarial examples that are independent of the shape of the class boundary. This independence is a simple consequence of an isoperimetric inequality. The classical isoperimetric inequality states that, of all closed surfaces that enclose a unit volume, the sphere has the smallest surface area. This simple fact is intuitive but famously difﬁcult to prove. For a historical review of the isoperimetric inequality and its variants, see Osserman et al. (1978). We will use a special variant of the isoperimetric inequality ﬁrst proved by L evy & Pellegrino (1951) and simpliﬁed by Talagrand (1995). Lemma 1 (Isoperimetric inequality). Consider a subset of the sphere A Sn 1 Rn with normalized measure µ1(A) 1/2. When using the geodesic metric, the ϵ-expansion A(ϵ) is at least as large as the ϵ-expansion of a half sphere.

The classical isoperimetric inequality is a simple geometric statement, and frequently appears without absolute bounds on the size of the ϵ-expansion of a half-sphere, or with bounds that involve unspeciﬁed constants (Vershynin, 2017). A tight bound derived by Milman & Schechtman (1986) is given below. The asymptotic blow-up of the ϵ-expansion of a half sphere predicted by this bound is shown in Figure 3. Lemma 2 (ϵ-expansion of half sphere). The geodesic ϵ-expansion of a half sphere has normalized measure at least

Lemmas 1 and 2 together can be taken to mean that, if a set is not too small, then in high dimensions almost all points on the sphere are reachable within a short ϵ jump from that set. These lemmas have immediate implications for adversarial examples, which are formed by mapping one class into another using small

Under review as a conference paper at ICLR 2019

perturbations. Despite its complex appearance, the result below is a consequence of the (relatively simple) isoperimetric inequality. Theorem 1 (Existence of Adversarial Examples). Consider a classiﬁcation problem with m object classes, each distributed over the unit sphere Sn 1 Rn with density functions {ρc}m c=1. Choose a classiﬁer function C : Sn 1 {1, 2, . . . , m} that partitions the sphere into disjoint measurable subsets. Deﬁne the following scalar constants:

Let Vc denote the magnitude of the supremum of ρc relative to the uniform density. This can be written Vc := sn 1 supx ρc(x). Let fc = µ1{x|C(x) = c} be the fraction of the sphere labeled as c by classiﬁer C.

Choose some class c with fc 1

2. Sample a random data point x from ρc. Then with probability at least

one of the following conditions holds:

1. x is misclassiﬁed by C, or 2. x admits an ϵ-adversarial example in the geodesic distance.

Proof. Choose a class c with fc 1

2. Let R = {x|C(x) = c} denote the region of the sphere labeled as class c by C, and let R be its complement. R(ϵ) is the ϵ-expansion of R in the geodesic metric. Because R covers at least half the sphere, the isoperimetric inequality (Lemma 1) tells us that the epsilon expansion is at least as great as the epsilon expansion of a half sphere. We thus have

µ1[R(ϵ)] 1 π

Now, consider the set Sc of safe points from class c that are correctly classiﬁed and do not admit adversarial perturbations. A point is correctly classiﬁed only if it lies inside R, and therefore outside of R. To be safe from adversarial perturbations, a point cannot lie within ϵ distance from the class boundary, and so it cannot lie within R(ϵ). It is clear that the set Sc of safe points is exactly the complement of R(ϵ). This set has normalized measure

The probability of a random point lying in Sc is bounded above by the normalized supremum of ρc times the normalized measure µ1[Sc]. This product is given by

We then subtract this probability from 1 to obtain the probability of a point lying outside the safe region, and arrive at equation 1.

In the above result, we measure the size of adversarial perturbations using the geodesic distance. Most studies of adversarial examples measure the size of perturbation in either the ℓ2 (Euclidean) norm or the ℓ (max) norm, and so it is natural to wonder whether Theorem 1 depends strongly on the distance metric. Fortunately (or, rather unfortunately) it does not.

It is easily observed that, for any two points x and y on a sphere,

d (x, y) d2(x, y) dg(x, y),

where d (x, y), d2(x, y), and dg(x, y) denote the l , Euclidean, and geodesic distance, respectively. From this, we see that Theorem 1 is actually fairly conservative; any ϵ-adversarial example in the geodesic metric would also be adversarial in the other two metrics, and the bound in Theorem 1 holds regardless of which of the three metrics we choose (although different values of ϵ will be appropriate depending on the norm).

Under review as a conference paper at ICLR 2019

4 WHAT ABOUT THE UNIT CUBE?

The above result about the sphere is simple and easy to prove using classical results. However, real world images do not lie on the sphere. In a more typical situation, images will be scaled so that their pixels lie in [0, 1], and data lies inside a high-dimensional hypercube (but, unlike the sphere, data is not conﬁned to its surface). The proof of Theorem 1 makes extensive use of properties that are exclusive to the sphere, and is not applicable to this more realistic setting. Are there still problem classes on the cube where adversarial examples are inevitable?

This question is complicated by the fact that geometric isoperimetric inequalities do not exist for the cube, as the shapes that achieve minimal ϵ-expansion (if they exist) depend on the volume they enclose and the choice of ϵ (Ros, 2001). Fortunately, researchers have been able to derive algebraic isoperimetric inequalities that provide lower bounds on the size of the ϵ-expansion of sets without identifying the shape that achieves this minimum (Talagrand, 1996; Milman & Schechtman, 1986). The result below about the unit cube is analogous to Proposition 2.8 in Ledoux (2001), except with tighter constants. For completeness, a proof (which utilizes methods from Ledoux) is provided in Appendix A. Lemma 3 (Isoperimetric inequality on a cube). Consider a measurable subset of the cube A [0, 1]n, and a p-norm distance metric dp(x, y) = x y p for p > 0. Let Φ(z) = (2π) 1

2 R z e t2/2dt, and let α be the scalar that satisﬁes Φ(α) = vol[A]. Then

vol[A(ϵ, dp)] Φ

where p = min(p, 2). In particular, if vol(A) 1/2, then we simply have

vol[A(ϵ, dp)] 1 exp( πn1 2/p ϵ2)

2πn1/2 1/p . (3)

Using this result, we can show that most data samples in a cube admit adversarial examples, provided the data distribution is not excessively concentrated. Theorem 2 (Adversarial examples on the cube). Consider a classiﬁcation problem with m classes, each distributed over the unit hypercube [0, 1]n with density functions {ρc}m c=1. Choose a classiﬁer function C : [0, 1]n {1, 2, . . . , m} that partitions the hypercube into disjoint measurable subsets. Deﬁne the following scalar constants:

Let Uc denote the supremum of ρc. Let fc be the fraction of hypercube partitioned into class c by C.

Choose some class c with fc 1

2, and select an ℓp-norm with p > 0. Deﬁne p = min(p, 2). Sample a random data point x from the class distribution ρc. Then with probability at least

1 Uc exp( πn1 2/p ϵ2)

2πn1/2 1/p . (4)

one of the following conditions holds:

1. x is misclassiﬁed by C, or 2. x has an adversarial example ˆx, with x ˆx p ϵ.

When adversarial examples are deﬁned in the ℓ2-norm (or for any p 2), the bound in equation 4 becomes

1 Uc exp( πϵ2)/(2π). (5)

Provided the class distribution is not overly concentrated, equation 5 guarantees adversarial examples with relatively small ϵ relative to a typical vector. In n dimensions, the ℓ2 diameter of the cube is n, and so it is reasonable to choose ϵ = O( n) in equation 5. In Figure 1, we chose ϵ = 10. A similarly strong bound of 1 Uc n exp( πϵ2/n)/(2π) holds for the case of the ℓ1-norm, in which case the diameter is n.

Under review as a conference paper at ICLR 2019

Oddly, equation 4 seems particularly weak when the ℓ norm is used. In this case, the bound on the right side of equation 4 becomes equation 5 just like in the ℓ2 case. However, ℓ adversarial examples are only interesting if we take ϵ < 1, in which case equation 5 becomes vacuous for large n. This bound can be tightened up in certain situations. If we prove Theorem 2 using the tighter (but messier) bound of equation 2 instead of equation 3, we can replace equation 4 with

1 Uc ˆΦ α +

for p 2, where ˆΦ(z) = 1

2π R z e t2/2dt 1

2πze z2/2 (for z > 0), and α = Φ 1(1 fc). For this bound to be meaningful with ϵ < 1, we need fc to be relatively small, and ϵ to be roughly fc or smaller. This is realistic for some problems; Image Net has 1000 classes, and so fc < 10 3 for at least one class.

Interestingly, under ℓ -norm attacks, guarantees of adversarial examples are much stronger on the sphere (Section 3) than on the cube. One might wonder whether the weakness of Theorem 4 in the ℓ case is fundamental, or if this is a failure of our approach. One can construct examples of sets with ℓ expansions that nearly match the behavior of equation 5, and so our theorems in this case are actually quite tight. It seems to be inherently more difﬁcult to prove the existence of adversarial examples in the cube using the ℓ -norm.

5 WHAT ABOUT SPARSE ADVERSARIAL EXAMPLES?

A number of papers have looked at sparse adversarial examples, in which a small number of image pixels, in some cases only one (Su et al., 2017), are changed to manipulate the class label. To study this case, we would like to investigate adversarial examples under the ℓ0 metric. The ℓ0 distance is deﬁned as

d(x, y) = x y 0 = card{i | xi = yi}.

If a point x has an ϵ-adversarial example in this norm, then it can be perturbed into a different class by modifying at most ϵ pixels (in this case ϵ is taken to be a positive integer).

Theorem 2 is fairly tight for p = 1 or 2. However, the bound becomes quite loose for small p, and in particular it fails completely for the important case of p = 0. For this reason, we present a different bound that is considerably tighter for small p (although slightly looser for large p).

The case p = 0 was studied by Milman & Schechtman (1986) (Section 6.2) and Mc Diarmid (1989), and later by Talagrand (1995; 1996). The proof of the following theorem (appendix B) follows the method used in Section 5 of Talagrand (1996), with modiﬁcations made to extend the proof to arbitrary p. Lemma 4 (Isoperimetric inequality on the cube: small p). Consider a measurable subset of the cube A [0, 1]n, and a p-norm distance metric d(x, y) = x y p for any p 0. We have

vol[A(ϵ, dp)] 1 exp ϵ2p/n

vol[A] , for p > 0 and (6)

vol[A(ϵ, d0)] 1 exp ϵ2/n

vol[A] , for p = 0. (7)

Using this result, we can prove a statement analogous to Theorem 2, but for sparse adversarial examples. We present only the case of p = 0, but the generalization to the case of other small p using Lemma 4 is straightforward. Theorem 3 (Sparse adversarial examples). Consider the problem setup of Theorem 2. Choose some class c with fc 1

2, and sample a random data point x from the class distribution ρc. Then with probability at least

1 2Uc exp( ϵ2/n) (8)

one of the following conditions holds:

1. x is misclassiﬁed by C, or 2. x can be adversarially perturbed by modifying at most ϵ pixels, while still remaining in the unit hypercube.

Under review as a conference paper at ICLR 2019

6 WHAT IF WE JUST SHOW THAT ADVERSARIAL EXAMPLES EXIST?

Tighter bounds can be obtained if we only guarantee that adversarial examples exist for some data points in a class, without bounding the probability of this event.

Theorem 4 (Condition for existence of adversarial examples). Consider the setup of Theorem 2. Choose a class c that occupies a fraction of the cube fc < 1

2. Pick an ℓp norm and set p = min(p, 2).

Let supp(ρc) denote the support of ρc. Then there is a point x with ρc(x) > 0 that admits an ϵ-adversarial example if

vol[supp(ρc)]

1 2 exp( πϵ2n1 2/p ), for p > 0 or

, for p = 0. (9)

The bound for the case p = 0 is valid only if ϵ p

It is interesting to consider when Theorem 4 produces non-vacuous bounds. When the ℓ2-norm is used, the bound becomes vol[supp(ρc)] exp( πϵ2)/2. The diameter of the cube is n, and so the bound becomes active for ϵ = n. Plugging this in, we see that the bound is active whenever the size of the support satisﬁes vol[supp(ρc)] > 1 2eπn . Remarkably, this holds for large n whenever the support of class c is larger than (or contains) a hypercube of side length at least e π 0.043. Note, however, that the bound being active does not guarantee adversarial examples with a small ϵ.

7 DISCUSSION: CAN WE ESCAPE FUNDAMENTAL BOUNDS?

There are a number of ways to escape the guarantees of adversarial examples made by Theorems 1-4. One potential escape is for the class density functions to take on extremely large values (i.e., exponentially large Uc); the dependence of Uc on n is addressed separately in Section 8.

Unbounded density functions and low-dimensional data manifolds In practice, image datasets might lie on low-dimensional manifolds within the cube, and the support of these distributions could have measure zero, making the density function inﬁnite (i.e., Uc = ). The arguments above are still relevant (at least in theory) in this case; we can expand the data manifold by adding a uniform random noise to each image pixel of magnitude at most ϵ1. The expanded dataset has positive volume. Then, adversarial examples of this expanded dataset can be crafted with perturbations of size ϵ2. This method of expanding the manifold before crafting adversarial examples is often used in practice. Tram er et al. (2017a) proposed adding a small perturbation to step off the image manifold before crafting adversarial examples. This strategy is also used during adversarial training (Madry et al., 2017).

Adding a don t know class The analysis above assumes the classiﬁer assigns a label to every point in the cube. If a classiﬁer has the ability to say I don t know, rather than assign a label to every input, then the region of the cube that is assigned class labels might be very small, and adversarial examples could be escaped even if the other assumptions of Theorem 4 are satisﬁed. In this case, it would still be easy for the adversary to degrade classiﬁer performance by perturbing images into the don t know class.

Feature squeezing If decreasing the dimensionality of data does not lead to substantially increased values for Uc (we see in Section 8 that this is a reasonable assumption) or loss in accuracy (a stronger assumption), measuring data in lower dimensions could increase robustness. This can be done via an auto-encoder (Meng & Chen, 2017; Shen et al., 2017), JPEG encoding (Das et al., 2018), or quantization (Xu et al., 2017).

Computational hardness It may be computationally hard to craft adversarial examples because of local ﬂatness of the classiﬁcation function, obscurity of the classiﬁer function, or other computational difﬁculties. Computational hardness could prevent adversarial attacks in practice, even if adversarial examples still exist.

Under review as a conference paper at ICLR 2019

8 EXPERIMENTS & EFFECT OF DIMENSIONALITY

In this section, we discuss the relationship between dimensionality and adversarial robustness, and explore how the predictions made by the theorems above are reﬂected in experiments.

It is commonly thought that high-dimensional classiﬁers are more susceptible to adversarial examples than low-dimensional classiﬁers. This perception is partially motivated by the observation that classiﬁers on highresolution image distributions like Image Net are more easily fooled than low resolution classiﬁers on MNIST (Tram er et al., 2017a). Indeed, Theorem 2 predicts that high-dimensional classiﬁers should be much easier to fool than low-dimensional classiﬁers, assuming the datasets they classify have comparable probability density limits Uc. However, this is not a reasonable assumption; we will see below that high dimensional distributions may be more concentrated than their low-dimensional counterparts.

We study the effects of dimensionality with a thought experiment involving a big MNIST image distribution. Given an integer expansion factor b, we can make a big MNIST distribution, denoted b-MNIST, by replacing each pixel in an MNIST image with a b b array of identical pixels. This expands an original 28 28 image into a 28b 28b image. Figure 4a shows that, without adversarial training, a classiﬁer on big MNIST is far more susceptible to attacks than a classiﬁer trained on the original MNIST1.

However, each curve in Figure 4a only shows the attack susceptibility of one particular classiﬁer. In contrast, Theorems 1-4 describe the fundamental limits of susceptibility for all classiﬁers. These limits are an inherent property of the data distribution. The theorem below shows that these fundamental limits do not depend in a non-trivial way on the dimensionality of the images in big MNIST, and so the relationship between dimensionality and susceptibility in Figure 4a results from the weakness of the training process.

Theorem 5. Suppose ϵ and p are such that, for all MNIST classiﬁers, a random image from class c has an ϵ-adversarial example (in the ℓ2-norm) with probability at least p. Then for all classiﬁers on b-MNIST, with integer b 1, a random image from c has a bϵ-adversarial example with probability at least p.

Likewise, if all b-MNIST classiﬁers have bϵ-adversarial examples with probability p for some b 1, then all classiﬁers on the original MNIST distribution have ϵ-adversarial examples with probability p.

Theorem 5 predicts that the perturbation needed to fool all 56 56 classiﬁers is twice that needed to fool all 28 28 classiﬁers. This is reasonable since the ℓ2-norm of a 56 56 image is twice that of its 28 28 counterpart. Put simply, fooling big MNIST is just as hard/easy as fooling the original MNIST regardless of resolution. This also shows that for big MNIST, as the expansion factor b gets larger and ϵ is expanded to match, the concentration bound Uc grows at exactly the same rate as the exponential term in equation 2 shrinks, and there is no net effect on fundamental susceptibility. Also note that an analogous result could be based on any image classiﬁcation problem (we chose MNIST only for illustration), and any p 0.

We get a better picture of the fundamental limits of MNIST by considering classiﬁers that are hardened by adversarial training2 (Figure 4b). These curves display several properties of fundamental limits predicted by our theorems. As predicted by Theorem 5, the 112 112 classifer curve is twice as wide as the 56 56 curve, which in turn is twice as wide as the 28 28 curve. In addition, we see the kind of phase transition behavior predicted by Theorem 2, in which the classiﬁer suddenly changes from being highly robust to being highly susceptible as ϵ passes a critical threshold. For these reasons, it is reasonable to suspect that the adversarially trained classiﬁers in Figure 4b are operating near the fundamental limit predicted by Theorem 2.

Theorem 5 shows that increased dimensionality does not increase adversarial susceptibility in a fundamental way. But then why are high-dimensional classiﬁers so easy to fool? To answer this question, we look at the concentration bound Uc for object classes. The smallest possible value of Uc is 1, which only occurs when images are spread out with uniform, uncorrelated pixels. In contrast, adjacent pixels in MNIST (and especially big MNIST) are very highly correlated, and images are concentrated near simple, low-dimensional manifolds, resulting in highly concentrated image classes with large Uc. Theory predicts that such highly concentrated datasets can be relatively safe from adversarial examples.

1 Only the fully-connected layer is modiﬁed to handle the difference in dimensionality between datasets. 2Adversarial examples for MNIST/CIFAR-10 were produced as in Madry et al. (2017) using 100-step/20-step PGD.

Under review as a conference paper at ICLR 2019

(a) MNIST at different resolutions

0 1 2 3 4 ϵ

Accuracy on adversarial examples (%)

MNIST 28X28 Naturally Trained LS=0.1 MNIST 56X56 Naturally Trained LS=0.1 MNIST 112X112 Naturally Trained LS=0.1

(b) MNIST with adver. training

0 5 10 15 20 ϵ

Accuracy on adversarial examples (%)

MNIST 28X28 Adversarially Trained MNIST 56X56 Adversarially Trained MNIST 112X112 Adversarially Trained

(c) CIFAR-10 vs big MNIST

0 2 4 6 8 10 ϵ

Accuracy on adversarial examples (%)

MNIST 56X56 Nat Trained LS=0.1 MNIST 56X56 Adversarially Trained CIFAR10 Nat Trained CIFAR10 Adversarially Trained

Figure 4: (a) Robustness of MNIST and big MNIST classiﬁers as a function of ϵ. Naturally trained classiﬁers are less robust with increased dimensionality. (b) With adversarial training, susceptibility curves behave as predicted by Theorems 2 and 5. (c) The susceptibility of CIFAR-10 is compared to big MNIST. Both datasets have similar dimension, but the higher complexity of CIFAR-10 results in far worse susceptibility. Perturbations are measured in the ℓ2-norm.

We can reduce Uc and dramatically increase susceptibility by choosing a more spread out dataset, like CIFAR-10, in which adjacent pixels are less strongly correlated and images appear to concentrate near complex, higher-dimensional manifolds. We observe the effect of decreasing Uc by plotting the susceptibility of a 56 56 MNIST classiﬁer against a classiﬁer for CIFAR-10 (Figure 4, right). The former problem lives in 3136 dimensions, while the latter lives in 3072, and both have 10 classes. Despite the structural similarities between these problems, the decreased concentration of CIFAR-10 results in vastly more susceptibility to attacks, regardless of whether adversarial training is used. The theory above suggests that this increased susceptibility is caused at least in part by a shift in the fundamental limits for CIFAR-10, rather than the weakness of the particular classiﬁers we chose.

Informally, the concentration limit Uc can be interpreted as a measure of image complexity. Image classes with smaller Uc are likely concentrated near high-dimensional complex manifolds, have more intra-class variation, and thus more apparent complexity. An informal interpretation of Theorem 2 is that high complexity image classes are fundamentally more susceptible to adversarial examples, and Figure 4 suggests that complexity (rather than dimensionality) is largely responsible for differences we observe in the effectiveness of adversarial training for different datasets.

9 SO...ARE ADVERSARIAL EXAMPLES INEVITABLE?

The question of whether adversarial examples are inevitable is an ill-posed one. Clearly, any classiﬁcation problem has a fundamental limit on robustness to adversarial attacks that cannot be escaped by any classiﬁer. However, we have seen that these limits depend not only on fundamental properties of the dataset, but also on the strength of the adversary and the metric used to measure perturbations. This paper provides a characterization of these limits and how they depend on properties of the data distribution. Unfortunately, it is impossible to know the exact properties of real-world image distributions or the resulting fundamental limits of adversarial training for speciﬁc datasets. However, the analysis and experiments in this paper suggest that, especially for complex image classes in high-dimensional spaces, these limits may be far worse than our intuition tells us.

Under review as a conference paper at ICLR 2019

Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions: with formulas, graphs, and mathematical tables, volume 55. Courier Corporation, 1965.

Anish Athalye and Ilya Sutskever. Synthesizing robust adversarial examples. ar Xiv preprint ar Xiv:1707.07397, 2017.

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. ar Xiv preprint ar Xiv:1802.00420, 2018.

Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim ˇSrndi c, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387 402. Springer, 2013.

Sergey G Bobkov et al. An isoperimetric inequality on the discrete cube, and an elementary proof of the isoperimetric inequality in gauss space. The Annals of Probability, 25(1):206 214, 1997.

Jacob Buckman, Aurko Roy, Colin Raffel, and Ian Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. Open Review, 2018.

Nicholas Carlini and David Wagner. Defensive distillation is not robust to adversarial examples. ar Xiv preprint ar Xiv:1607.04311, 2016.

Nicholas Carlini and David Wagner. Magnet and efﬁcient defenses against adversarial attacks are not robust to adversarial examples. ar Xiv preprint ar Xiv:1711.08478, 2017a.

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pp. 39 57. IEEE, 2017b.

Nilaksh Das, Madhuri Shanbhogue, Shang-Tse Chen, Fred Hohman, Siwei Li, Li Chen, Michael E Kounavis, and Duen Horng Chau. Shield: Fast, practical defense and vaccination for deep learning using JPEG compression. ar Xiv preprint ar Xiv:1802.06816, 2018.

Guneet S Dhillon, Kamyar Azizzadenesheli, Zachary C Lipton, Jeremy Bernstein, Jean Kossaiﬁ, Aran Khanna, and Anima Anandkumar. Stochastic activation pruning for robust adversarial defense. ar Xiv preprint ar Xiv:1803.01442, 2018.

Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classiﬁer. ar Xiv preprint ar Xiv:1802.08686, 2018.

Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian Goodfellow. Adversarial spheres. ar Xiv preprint ar Xiv:1801.02774, 2018.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. Countering adversarial images using input transformations. ar Xiv preprint ar Xiv:1711.00117, 2017.

J Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. ar Xiv preprint ar Xiv:1711.00851, 2017.

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. ar Xiv preprint ar Xiv:1607.02533, 2016.

Michel Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2001.

Paul L evy and Franco Pellegrino. Probl emes concrets d analyse fonctionnelle. Gauthier-Villars, Paris, 1951.

Under review as a conference paper at ICLR 2019

Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Michael E Houle, Grant Schoenebeck, Dawn Song, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. ar Xiv preprint ar Xiv:1801.02613, 2018.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Saeed Mahloujifar, Dimitrios I Diochnos, and Mohammad Mahmoody. The curse of concentration in robust learning: Evasion and poisoning attacks from concentration of measure. ar Xiv preprint ar Xiv:1809.03063, 2018.

C Mc Diarmid. On the method of bounded differences. London Mathematical Society Lecture Notes, 141: 148 188, 1989.

Dongyu Meng and Hao Chen. Mag Net: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 135 147. ACM, 2017.

Vitali D Milman and Gideon Schechtman. Asymptotic Theory of Finite Dimensional Normed Spaces. Springer-Verlag, Berlin, Heidelberg, 1986. ISBN 0-387-16769-2.

Robert Osserman et al. The isoperimetric inequality. Bulletin of the American Mathematical Society, 84(6): 1182 1238, 1978.

Nicolas Papernot, Patrick Mc Daniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. ar Xiv preprint ar Xiv:1605.07277, 2016a.

Nicolas Papernot, Patrick Mc Daniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pp. 582 597. IEEE, 2016b.

Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certiﬁed defenses against adversarial examples. ar Xiv preprint ar Xiv:1801.09344, 2018.

Antonio Ros. The isoperimetric problem. Global theory of minimal surfaces, 2:175 209, 2001.

Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-GAN: Protecting classiﬁers against adversarial attacks using generative models. ar Xiv preprint ar Xiv:1805.06605, 2018.

Shiwei Shen, Guoqing Jin, Ke Gao, and Yongdong Zhang. APE-GAN: Adversarial perturbation elimination with GAN. ICLR Submission, available on Open Review, 4, 2017.

Carl-Johann Simon-Gabriel, Yann Ollivier, Bernhard Sch olkopf, L eon Bottou, and David Lopez-Paz. Adversarial vulnerability of neural networks increases with input dimension. ar Xiv preprint ar Xiv:1802.01421, 2018.

Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. Open Review, 2018.

Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixel Defend: Leveraging generative models to understand and defend against adversarial examples. ar Xiv preprint ar Xiv:1710.10766, 2017.

Jiawei Su, Danilo Vasconcellos Vargas, and Sakurai Kouichi. One pixel attack for fooling deep neural networks. ar Xiv preprint ar Xiv:1710.08864, 2017.

Vladimir Sudakov and Boris Tsirelson. Extremal properties of half-spaces for spherically invariant measures. J. Soviet Math., 41, 1974.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Under review as a conference paper at ICLR 2019

Michel Talagrand. Concentration of measure and isoperimetric inequalities in product spaces. Publications Math ematiques de l Institut des Hautes Etudes Scientiﬁques, 81(1):73 205, 1995.

Michel Talagrand. A new look at independence. The Annals of probability, pp. 1 34, 1996.

Florian Tram er, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. Ensemble adversarial training: Attacks and defenses. ar Xiv preprint ar Xiv:1705.07204, 2017a.

Florian Tram er, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. The space of transferable adversarial examples. ar Xiv preprint ar Xiv:1704.03453, 2017b.

Roman Vershynin. High-Dimensional Probability. Cambridge University Press, 2017.

Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. Mitigating adversarial effects through randomization. ar Xiv preprint ar Xiv:1711.01991, 2017.

Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. ar Xiv preprint ar Xiv:1704.01155, 2017.

Valentina Zantedeschi, Maria-Irina Nicolae, and Ambrish Rawat. Efﬁcient defenses against adversarial attacks. In Proceedings of the 10th ACM Workshop on Artiﬁcial Intelligence and Security, pp. 39 49. ACM, 2017.

A PROOF OF LEMMA 3

We now prove Lemma 3. To do this, we begin with a classical isoperimetric inequality for random Gaussian variables. Unlike the case of a cube, tight geometric isoperimetric inequalities exist in this case. We then prove results about the cube by creating a mapping between uniform random variables on the cube and random Gaussian vectors.

In the lemma below, we consider the standard Gaussian density in Rn given by p(x) = 1 (2π)n/2 e nx2/2 and corresponding Gaussian measure µ. We also deﬁne

which is the cumulative density of a Gaussian curve.

The following Lemma was ﬁrst proved in Sudakov & Tsirelson (1974), and an elementary proof was given in Bobkov et al. (1997).

Lemma 5 (Gaussian Isoperimetric Inequality). Of all sets with the same Gaussian measure, the set with ℓ2 ϵ-expansion of smallest measure is a half space. Furthermore, for any measurable set A Rn, and scalar constant a such that Φ(a) = µ[A], µ[A(ϵ, d2)] Φ(a + ϵ).

Using this result we can now give a proof of Lemma 3.

This function Φ maps a random Guassian vector z N(0, I) onto a random uniform vector in the unit cube. To see why, consider a measurable subset B Rn. If µ is the Gaussian measure on Rn and σ is the uniform measure on the cube, then

σ[Φ(B)] = Z χΦ(B)(s)dσ = Z χΦ(B)(Φ(z)) 1 det(JΦ)dz = Z χB(z)dµ = µ[B].

Since zi Φ(z) 1

2π, we also have

Φ(z) Φ(w) p 1

2π (x w) p = 1

Under review as a conference paper at ICLR 2019

for any z, w Rn. From this, we see that for p = min(p, 2)

Φ(z) Φ(w) p n1/p 1/2 Φ(z) Φ(w) 2 n1/p

2πn z w 2 (10)

where we have used the identity u p n1/ min(p,2) 1/2 u 2.

Now, consider any set A in the cube, and let B = Φ 1(A). From equation 10, we see that

2πn n1/p ϵ, d2

It follows from equation 10 that

σ[A(ϵ, dp)] µ

2πn n1/p ϵ, d2

Applying Lemma 5, we see that

σ[A(ϵ, dp)] Φ

where α = Φ 1(σ[A]).

To obtain the simpliﬁed formula in the theorem, we use the identity

x e t2dt < e x2

2πx which is valid for x > 0, and can be found in Abramowitz & Stegun (1965).

B PROOF OF LEMMA 4

Our proof emulates the method of Talagrand, with minor modiﬁcations that extend the result to other ℓp norms. We need the following standard inequality. Proof can be found in Talagrand (1995; 1996). Lemma 6 (Talagrand). Consider a probability space Ωwith measure µ. For g : Ω [0, 1], we have Z

Ω min et, 1

Ω gdµ α exp t2(α + 1)

Our proof of Lemma 3 follows the three-step process of Talagrand illustrated in Talagrand (1995). We begin by proving the bound Z etf(x,A)dx 1 σα[A] exp nt2(α + 1)

where f(x, A) = miny A P

i |xi yi|p = dp p(x, A) is a measure of distance from A to x, and α, t are arbitrary positive constants. Once this bound is established, a Markov bound can be used to obtain the ﬁnal result. Finally, constants are tuned in order to optimize the tightness of the bound.

We start by proving the bound in equation 12 using induction on the dimension. The base case for the induction is n = 1, and we have Z etf(x,A)dx σ[A]+ Z

Ac etf(x,A)dx σ[A]+ Z

Ac 1dx σ[A]+(1 σ[A])et 1 σα[A] exp t2(α + 1)

We now prove the result for n dimensions using the inductive hypothesis. We can upper bound the integral by integrating over slices along one dimension. Let A [0, 1]n. Deﬁne

Aω = {z Rn 1 | (ω, z) A} and B = {z Rn 1 | (ω, z) A for some ω}.

Under review as a conference paper at ICLR 2019

Clearly, the distance from (ω, z) to A is at most the distance from z to Aω, and so Z etf(x,A)dx Z

z [0,1]n 1 etf(z,Aω)dz dx Z

1 σα[Aω] exp (n 1)t2(α + 1)

We also have that the distance from x to A is at most one unit greater than the distance from x to B. This gives us Z etf(x,A)dx Z

(ω,z) [0,1]n et(f(x,B)+1) et Z

(ω,z) [0,1]n etf(x,B) et

σα[B] exp (n 1)t2(α + 1)

Applying equation 6 gives us

Z etf(x,A)dx Z

ω [0,1] min et

σα[B] exp (n 1)t2(α + 1)

, 1 σα[Aω] exp (n 1)t2(α + 1)

= exp (n 1)t2(α + 1)

ω [0,1] min et, σα[B]

Now, we apply lemma 6 to equation 13 with g(ω) = α[Aω]/α[B] to arrive at equation 12.

The second step of the proof is to produce a Markov inequality from equation 12. For the bound in equation 12 to hold, we need

1 σ[A(ϵ, dp)] = σ{x | f(x) > ϵp}

R etf(x,A)dx

etϵp exp nt2(α+1)

σα[A]etϵp . (14)

The third step is to optimize the bound by choosing constants. We minimize the right hand side by choosing t = 4αϵp n(α+1) to get

1 σ[A(ϵ, dp)] exp 2αϵ2p

σα[A] . (15)

Now, we can simply choose α = 1 to get the simple bound

1 σ[A(ϵ, dp)] exp ϵ2p/n

σ[A] , (16)

or we can choose the optimal value of α = q

2ϵ2p n log(1/σ) 1, which optimizes the bound in the case ϵ2p

n 2 log(1/σ(A)). We arrive at

1 σ[A(ϵ, dp)] exp 2

n log(σ 1[A])/2 2 . (17)

This latter bound is stronger than we need to prove Lemma 3, but it will come in handy later to prove Theorem 4.

C PROOF OF THEOREMS 2 AND 3

We combine the proofs of these results since their proofs are nearly identical. The proofs closely follow the argument of Theorem 1.

Choose a class c with fc 1 2 and let R = {x|C(x) = c} denote the subset of the cube lying in class c according to the classiﬁer C. Let R be the complement, who s ℓp expansion is denoted R(ϵ; dp). Because R covers at least half the cube, we can invoke Lemma 3. We have that

vol[R(ϵ; h)] 1 δ,

Under review as a conference paper at ICLR 2019

( exp( πn1 2/p ϵ2)

2πn1/2 1/p , for Theorem 2 and 2Uc exp( ϵ2/n), for Theorem 3. (18)

The set R(ϵ; h) contains all points that are correctly classiﬁed and safe from adversarial perturbations. This region has volume at most δ, and the probability of a sample from the class distribution ρc lying in this region is at most Ucδ. We then subtract this from 1 to obtain the mass of the class distribution lying in the unsafe region Rc.

D PROOF OF THEOREM 4

Let A denote the support of pc, and suppose that this support has measure vol[A] = η. We want to show that, for large enough ϵ, the expansion A(ϵ, dp) is larger than half the cube. Since class c occupies less than half the cube, this would imply that A(ϵ, dp) overlaps with other classes, and so there must be data points in A with ϵ-adversarial examples.

We start with the case p > 0, where we bound A(ϵ, dp) using equation 2 of Lemma 3. To do this, we need to approximate Φ 1(η). This can be done using the inequality

which holds for α < 0. Rearranging, we obtain

log 1 4Φ(α)2 . (19)

Now, if α = Φ 1(η), then Φ(α) = η, and equation 19 gives us α q

log 1 4η2 . Plugging this into equation 2 of Lemma 3, we get

vol[A(ϵ, dp)] Φ(α + ϵ) Φ

The quantity on the left will be greater than 1

2, thus guaranteeing adversarial examples, if

2πn n1/p ϵ > r

This can be re-arranged to obtain the desired result.

In the case p = 0, we need to use equation 17 from the proof of Lemma 3 in Appendix B, which we restate here

vol[A(ϵ, d0)] 1 exp 2

n log(1/η)/2 2 .

This bound is valid, and produces a non-vacuous guarantee of adversarial examples, if

n log(1/η)/2 2 < 1

which holds if

n log 2/2 2

Under review as a conference paper at ICLR 2019

E PROOF OF THEOREM 5

Assume that any MNIST classiﬁer can be fooled by perturbations of size at most ϵ with probability at least p. To begin, we put a bound on the susceptibility of any b-MNIST classiﬁer (for b 1) under this assumption. We can classify MNIST images by upsampling them to resolution 28b 28b and feeding them into a highresolution back-end classiﬁer. After upsampling, an MNIST image with perturbation of norm ϵ becomes a 28b 28b image with perturbation of norm bϵ. The classiﬁer we have constructed takes low-resolution images as inputs, and so by assumption it is fooled with probability at least p. However, the low-resolution classiﬁer is fooled only when the high-resolution back-end classiﬁer is fooled, and so the high-resolution classiﬁer is fooled with probability at least p at well. Note that we can build this two-scale classiﬁer using any high-resolution classiﬁer as a back-end, and so this bound holds uniformly over all high-resolution classiﬁers.

Likewise, suppose we classify b-MNIST images (for integer b 1) by downsampling them to the original 28 28 resolution (by averaging pixel blocks) and feeding them into a back-end low-resolution classiﬁer. After downsampling, a 28b 28b image with perturbation of norm bϵ becomes a 28 28 image with perturbation of norm at most ϵ. Whenever the high-resolution classiﬁer is fooled, it is only because the back-end classiﬁer is fooled by a perturbation of size at most ϵ, and this happens with probability at least p.