# with_friends_like_these_who_needs_adversaries__303a237b.pdf

With Friends Like These, Who Needs Adversaries?

Saumya Jetley 1 Nicholas A. Lord 1,2 Philip H.S. Torr1,2

1Department of Engineering Science, University of Oxford 2Oxford Research Group, Five AI Ltd. {sjetley, nicklord, phst}@robots.ox.ac.uk

The vulnerability of deep image classiﬁcation networks to adversarial attack is now well known, but less well understood. Via a novel experimental analysis, we illustrate some facts about deep convolutional networks for image classiﬁcation that shed new light on their behaviour and how it connects to the problem of adversaries. In short, the celebrated performance of these networks and their vulnerability to adversarial attack are simply two sides of the same coin: the input image-space directions along which the networks are most vulnerable to attack are the same directions which they use to achieve their classiﬁcation performance in the ﬁrst place. We develop this result in two main steps. The ﬁrst uncovers the fact that classes tend to be associated with speciﬁc image-space directions. This is shown by an examination of the class-score outputs of nets as functions of 1D movements along these directions. This provides a novel perspective on the existence of universal adversarial perturbations. The second is a clear demonstration of the tight coupling between classiﬁcation performance and vulnerability to adversarial attack within the spaces spanned by these directions. Thus, our analysis resolves the apparent contradiction between accuracy and vulnerability. It provides a new perspective on much of the prior art and reveals profound implications for efforts to construct neural nets that are both accurate and robust to adversarial attack.1

1 Introduction

Those studying deep networks ﬁnd themselves forced to confront an apparent paradox. On the one hand, there is the demonstrated success of networks in learning class distinctions on training sets that seem to generalise well to unseen test data. On the other, there is the vulnerability of the very same networks to adversarial perturbations that produce dramatic changes in class predictions despite being counter-intuitive or even imperceptible to humans. A common understanding of the issue can be stated as follows: While deep networks have proven their ability to distinguish between their target classes so as to generalise over unseen natural variations, they curiously possess an Achilles heel which must be defended. In fact, efforts to formulate attacks and counteracting defences of networks have led to a dedicated competition [1] and a body of literature already too vast to summarise in total.

In the current work we attempt to demystify this phenomenon at a fundamental level. We base our work on the geometric decision boundary analysis of [2], which we reinterpret and extend into a framework that we believe is simpler and more illuminating with regards to the aforementioned paradoxical behaviour of deep convolutional networks (DCNs) for image classiﬁcation. Through a fairly straightforward set of experiments and explanations, we clarify what it is that adversarial examples represent, and indeed, what it is that modern DCNs do and do not currently do. In doing so, we tie together work which has focused on adversaries per se with other work which has sought to characterise the feature spaces learned by these networks.

S. Jetley and N.A. Lord have contributed equally and assert joint ﬁrst authorship. 1Source code for replicating all experiments is provided at https://github.com/torrvision/whoneedsadversaries.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

-1500 -1000 -500 0 500 1000 1500 Perturbation scaling factor (s)

CIFAR10-NIN, Pos.

-1500 -1000 -500 0 500 1000 1500 Perturbation scaling factor (s)

CIFAR10-NIN, Neg.

Figure 1: Plots of the frog class score Ffrog(s|in, dj, θ) for the Network-in-Network [3] architecture trained on CIFAR10, associated with two speciﬁc image-space directions d1 and d2 respectively. These directions are visualised as 2D images in the row below; the method of estimating them is explained in Sec. 3. Each plot corresponds to a randomly selected CIFAR10 test image in. Adding or subtracting components along d1 causes the network to change its prediction to frog : as can be seen, a deer with a mild diamond striping added to it gets classiﬁed as a frog . This happens with little regard for the choice of input image in itself. Likewise, perturbations along d2 change any frog to a non-frog class: notice the predicted labels for the sample images along the red curve in the second plot. These class-transition phenomena are predicted by the framework developed in this paper. While simplistic functions along directions d1 and d2 are used by the network to accomplish the task of classiﬁcation, perturbations along the very same directions constitute adversarial attacks.

Letˇi represent vectorised input images and i be the average vector-image over a given dataset. Then, the mean-normalised version of the dataset is denoted by I = {i1, i2, i N}, where the nth image in =ˇin i. We deﬁne the perturbation of the image in in the direction dj as: in in + sˆdj, where s is the perturbation scaling factor and ˆdj is the unit-norm vector in the direction dj. The image is fed through a network parameterised by θ and the output score2 for a speciﬁc class c is given by Fc( i|θ). This class-score function can be rewritten as Fc(in + sˆdj|θ), which we equivalently denote by Fc(s|in, dj, θ). Our work examines the nature of Fc as a function of movement s in speciﬁc image-space directions dj starting from randomly sampled natural images in, for a variety of classiﬁcation DCNs. With this novel analysis, we uncover three noteworthy observations about these functions that relate directly to the phenomenon of adversarial vulnerability in these nets, all of which are on display in Fig. 1. We now discuss these observations in more detail.

Before we begin, we note that these directions dj are obtained via the method explained in Sec. 3 and by design exhibit either positive or negative association with a speciﬁc class. In Fig. 1 we study two such directions for the frog class: similar directions exist for all other classes. Firstly, notice that the score of the corresponding class c ( frog , in this case) as a function of s is often approximately symmetrical about some point s0, i.e. Fc(s s0|in, dj, θ) Fc( s s0|in, dj, θ) s, and monotonic in both half-lines. This means that simply increasing the magnitude of correlation between the input image and a single direction causes the net to believe that more (or less) of the class c is present. In other words, the image-space direction sends all images either towards or away from the class c. In the former scenario, the direction represents a class-speciﬁc universal adversarial perturbation (UAP). Second, let id = i ˆd, and let id be the projection of i onto the space normal to ˆd, such that id = i idˆd. Then, our results illustrate that there exists a basis of image space containing ˆd such that the class-score function is approximately additively separable i.e. Fc(i|θ) = Fc([id, id ]|θ) G(id) + H(id ) for some functions G and H. This means that the directions under study can be used to alter the nets predictions almost independently of each other. However, despite these facts, their 2D visualisation reveals low-level structures that are devoid of a clear semantic link to the associated classes, as shown in Fig. 1. Thus, we demonstrate that the learned functions encode a more simplistic notion of class identity than DCNs are commonly assumed to represent, albeit one that generalises to the test distribution to an extent. Unsurprisingly, this does not align with the way in which the human visual system makes use of these data dimensions: adversarial vulnerability is simply the name given to this disparity and the phenomena derived from it, with the universal adversarial perturbations of [4] being a particularly direct example of this fact.

Finally, we show that nets classiﬁcation performance and adversarial vulnerability are inextricably linked by the way they make use of the above directions, on a variety of architectures. Consequently,

2The output of the layer just before the softmax operation, commonly known as the logit layer.

efforts to improve robustness by suppressing nets responses to components in these directions (e.g. [5]) cannot simultaneously retain full classiﬁcation accuracy. The features and functions thereof that DCNs currently rely on to solve the classiﬁcation problem are, in a sense, their own worst adversaries.

2 Related Work

2.1 Fundamental developments in attack methods

Szegedy et al. coined the term adversarial example in [7], demonstrating the use of box-constrained L-BFGS to estimate a minimal ℓ2-norm additive perturbation to an input image to cause its label to change to a target class while keeping the resulting image within intensity bounds. Strikingly, they locate a small-norm (imperceptible) perturbation at every point, for every network tested. Further, the adversaries thus generated are able to fool nets trained differently to one another, even when trained with different subsets of the data. Goodfellow et al. [8] subsequently proposed the fast gradient sign method (FGSM) to demonstrate the effectiveness of the local linearity assumption in producing the same result, calculating the gradient of the cost function and perturbing with a ﬁxed-size step in the direction of its sign (optimal under the linearity assumption and an ℓ -norm constraint). The Deep Fool method of Moosavi-Dezfooli et al. [9] retains the ﬁrst-order framework of FGSM, but tailors itself precisely to the goal of ﬁnding the perturbation of minimum norm that changes the class label of a given natural image to any label other than its own. Through iterative attempts to cross the nearest (linear) decision boundary by a tiny margin, this method records successful perturbations with norms that are even smaller than those of [8]. In [4], Moosavi-Dezfooli & Fawzi et al. propose an iterative aggregation of Deep Fool perturbations that produces universal adversarial perturbations: single images which function as adversaries over a large fraction of an entire dataset for a targeted net. While these perturbations are typically much larger than individual Deep Fools, they do not correspond to human perception, and indicate that there are ﬁxed image-space directions along which nets are vulnerable to deception independently of the image-space locations at which they are applied. They also demonstrate some generalisation over network architectures.

Sabour & Cao et al. [10] pose an interesting variant of the problem: instead of label adversaries , they target feature adversaries which minimise the distance from a particular guide image in a selected network feature space, subject to a constraint on the ℓ -norm of image-space distance from a source image. Despite this constraint, the adversarial image mimics the guide very closely: not only is it nearly always assigned to the guide s class, but it appears to be an inlier with respect to the guide-class distribution in the chosen feature space. Finally, while adversaries are conceived of as small perturbations applied to natural images such that the resulting images are still recognisable to humans, the fooling images of Nguyen et al. [11] are completely unrecognisable to humans and yet conﬁdently predicted by deep networks to be of particular classes. Such images are easily obtained by both evolutionary algorithms and gradient ascent, under direct encoding of pixel intensities (appearing to consist mostly of noise) and under CPPN [12]-regularised encoding (appearing as abstract mid-level patterns).

2.2 Analysis of adversarial vulnerability and proposed defences

In [13], Wang et al. propose a nomenclature and theoretical framework with which to discuss the problem of adversarial vulnerability in the abstract, agnostic of any actual net or attack thereof. They denote an oracle relative to whose judgement robustness and accuracy must be assessed, and illustrate that a classiﬁer can only be both accurate and robust (invulnerable to attack) relative to its oracle if it learns to use exactly the same feature space that the oracle does. Otherwise, a network is vulnerable to adversarial attack in precisely the directions in which its feature space departs from that of the oracle. Under the assumption that a net s feature space contains some spurious directions, Gao et al. [5] propose a subtractive scheme of suppressing the neuronal activations (i.e. feature responses) which change signiﬁcantly between the natural and adversarial inputs. Notably, the increase in robustness is accompanied by a loss of performance accuracy. An alternative to network feature suppression is the compression of input image data explored in e.g. [14, 15, 16].

Goodfellow et al. [8] hypothesise that the high dimensionality and excessive linearity of deep networks explain their vulnerability. Tanay and Grifﬁn [17] begin by taking issue with the above via illustrative toy problems. They then advance an explanation based on the angle of intersection

of the separating boundary with the data manifold which rests on overﬁtting and calls for effective regularisation - which they note is neither solved nor known to be solvable for deep nets. A variety of training-based [8, 18, 19, 20] methods are proposed to address the premise of the preceding analyses. Hardening methods [8, 18] investigate the use of adversarial examples to train more robust deep networks. Detection-based methods [19, 20] view adversaries as outliers to the training data distribution and train detectors to identify them as such in the intermediate feature spaces of nets. Notably, these methods [19, 20] have not been evaluated on the feature adversaries of Sabour & Cao et al. [10]. Further, data augmentation schemes such as that of Zhang et al. [21], wherein convex combinations of input images are mapped to convex combinations of their labels, attempt to enable the nets to learn smoother decision boundaries. While their approach [21] offers improved resistance to single-step gradient sign attacks, it is no more robust to iterative attacks of the same type.

Over the course of the line of work in [2], [22], [23], and [24], the authors build up an image-space analysis of the geometry of deep networks decision boundaries, and its connection with adversarial vulnerability. In [23], they note that the Deep Fool perturbations of [9] tend to evince relatively high components in the subspace spanned by the directions in which the decision boundary has a high curvature. Also, the sign of the mean curvature of the decision boundary in the vicinity of a Deep Fooled image is typically reversed with respect to that of the corresponding natural image, which provides a simple scheme to identify and undo the attack. They conclude that a majority of imagespace directions correspond to near-ﬂatness of the decision boundary and are insensitive to attack, but along the remaining directions, those of signiﬁcant curvature, the network is indeed vulnerable. Further, the directions in question are observed to be shared over sample images. They illustrate in [2] why a hypothetical network which possessed this property would theoretically be predicted to be vulnerable to universal adversaries, and note that the analysis suggests a direct construction method for such adversaries as an alternative to the original randomised iterative approach of [4]: they can be constructed as random vectors in the subspace of shared high-curvature dimensions.

The analysis begins as in [2], with the extraction of the principal directions and principal curvatures of the classiﬁer s image-space class decision boundaries. Put simply, a principal direction vector and its associated principal curvature tell you how much a surface curves as you move along it in a particular direction, from a particular point. Now, it takes many decision boundaries to characterise the classiﬁcation behaviour of a multiclass net: C 2 for a C-class classiﬁer. However, in order to understand the boundary properties that are useful for discriminating a given class from all others, it should sufﬁce to analyse only the C 1-vs.-all decision boundaries. Thus, for each class c, the method proceeds by locating samples very near to the decision boundary (Fc Fˆc) = 0 between c and the union of remaining classes ˆc = c. In practice, for each sample, this corresponds to the decision boundary between c and the closest neighbouring class c, which is arrived at by perturbing the sample from the latter ( source ) to the former ( target ). Then, the geometry of the decision boundary is estimated as outlined in Alg. 1 below3, closely following the approach of [2]:

Algorithm 1 Computes mean principal directions and principal curvatures for a net s image-space decision surface.

Input: network class score function F, dataset I = {i1, i2, i N}, target class label c Output: principal curvature basis matrix Vb and corresponding principal curvature-score vector vs

procedure PRINCIPALCURVATURES(F, I, c)

H null for each sample in I s.t. argmaxk(Fk(in)) = c do

ˆc argmaxk(Fk(in)) network predicts in to be of class ˆc Hcˆc: deﬁne as Hessian of function (Fc Fˆc) subscripts select class scores in DEEPFOOL(in, c) approximate nearest boundary point to in H H + Hcˆc( in) accumulate Hessian at sample boundary point H H/ I normalise mean Hessian by number of samples (Vb, vs) = EIGS(H) compute eigenvectors and eigenvalues of mean Hessian return (Vb, vs)

The authors of [2] advance a hypothesis connecting positively curved directions with the universal adversarial perturbations of [4]. Essentially, they demonstrate that if the normal section of a net s decision surface along a given direction can be locally bounded on the outside by a circular arc of

3For more discussion about the implementation and associated concepts, refer to the supplementary material.

a particular positive curvature in the vicinity of a sample image point, then geometry accordingly dictates an upper bound on the distance between that point and the boundary in that direction. If such directions and bounds turn out to be largely common across sample image points (which they do), then the existence of universal adversaries follows directly, with higher curvature implying lower-norm adversaries. This argument is depicted visually in the supplementary material. It is from this point that we move beyond the prior art and begin an iterative loop of further analysis, experimentation, and demonstration, as follows.

4 Experiments and Analysis

Provided only that the second-order boundary approximation holds up well over a sufﬁciently wide perturbation range and variety of images, the model implies that the distance of such adversaries from the decision boundary should increase as a function of their norm. Also, the attack along any positively curved direction should in that case be associated with the corresponding target class: the class c in the call to Alg. 1. And while positively curved directions may be of primary interest in [2], the extension of the above geometric argument to the negative-curvature case points to an important corollary: as sufﬁcient steps along positive-curvature directions should perturb increasingly into class c, so should steps along negative-curvature directions perturb increasingly away from class c. Finally, the plethora of approximately zero-curvature (ﬂat) directions identiﬁed in [23, 2] should have negligible effect on class identity.

-5000 0 5000 0

0.4 1. MNIST-Le Net

Target class 9 Highest non-target

-5000 0 5000 0

Sample proportion Median smax score

Positive curvature directions

-1000 0 1000

2. CIFAR10-Ni N

Target class 7

-1000 0 1000 0

-1000 0 1000 -5

3. CIFAR10-Alex Net

Target class 5

-1000 0 1000 0

-1000 0 1000

4. CIFAR100-VGG

Target class 2

-1000 0 1000 0

Figure 2: Selected class scores plotted as functions of the scaling factor s of the perturbation along the most positively curved direction per net. The Median class score plot compares the score of a randomly selected target class with the supremum of the scores for the non-target classes. Each curve represents the median of the class scores over the associated dataset, bracketed below by the 30th-percentile score and above by the 70th. The Transition into target class plot depicts the fraction of the dataset not originally of the target class, but which is transitioned into the target class by the perturbation. Alongside, we graph that population s median softmax target-class score. The black dashed line represents the fraction of the population originally of the target class that remains in the target class under the perturbation. The image grid on the right illustrates the 2D visualisations of the two most-positively curved directions for randomly selected target classes: the columns correspond, from left to right, with the four net-dataset pairs under study. To observe class scores as functions of the norms of the perturbations along the most negatively curved and ﬂat directions, refer to the supplement.

4.1 Class identity as a function of the component in speciﬁc image-space directions

To test how well the above conjectures hold in practice, we graph statistics of the target and nontarget class scores over the dataset as a function of the magnitude of the perturbation applied in directions identiﬁed as above. The results are depicted in Fig. 2, in which the predicted phenomena are readily evident. Along the selected positive-curvature directions, as the perturbation magnitude increases (with either sign), the population s target class score approaches and then surpasses the highest non-target class score. The monotonicity of this effect is laid bare by graphing the fraction of non-target samples perturbed into the target class, alongside the median target class softmax score. Note, again, that the link between the directions in question and the target class identity is established a priori by Alg. 1. We continue in the supplementary material and show that, as predicted, the same phenomenon is evident in reverse when using negative-curvature directions instead. All that changes is that it is the population s non-target class scores that overtake its target class score with increasing

perturbation norm, with natural samples of the target class accordingly being perturbed out of it. We also illustrate the point that ﬂatness of the decision boundary manifests as ﬂatness of both target and non-target class scores: over a wide range of magnitudes, these directions do not inﬂuence the network in any way. While Fig. 2 illustrates these effects at the level of the population, Fig. 1 shows a disaggregation into individual sample images, with one response curve per sample from a large set. The population-level trends remain evident, but another fact becomes apparent: empirically, the shapes of the curves change very little between most samples. They shift vertically to reﬂect the class score contribution of the orthonormal components, but they themselves do not otherwise much depend on those components. That is to say that at least some key components are approximately additively separable from one another. This fact connects directly to the fact that such directions are shared across samples in the ﬁrst place, and thus identiﬁable by Alg. 1.

A more intuitive picture of what the networks are actually doing begins to emerge: they are identifying the high-curvature image-space directions as features associated with respective class identities, with the curvature magnitude representing the sensitivity of class identity to the presence of that feature. But if this is true, it suggests that what we have thus identiﬁed are actually the directions which the net relies on generally in predicting the classes of natural images, with the curvatures-cum-sensitivities representing their relative weightings. Accordingly, it should be possible to disregard the ﬂat directions of near-zero curvature without any noticeable change in the network s class predictions.

4.2 Network classiﬁcation performance versus effective data dimensionality

To conﬁrm the above hypothesis regarding the relative importance of different image-space directions for classiﬁcation, we plot the training and test accuracies of a sample of nets as a function of the subspace onto which their input images are projected. The input subspace is parametrised by a dimensionality parameter d, which controls the number of basis vectors selected per class. We use four variants of selection: the d most positively curved directions per class (yielding the subspace Spos); the d most negatively curved directions per class (yielding the subspace Sneg); the union of the previous two (subspace Sneg pos); and the d least curved (ﬂattest) directions per class (subspace Sﬂat). The subspace S so obtained is represented by the orthonormalised basis matrix Qd (obtained by QR decomposition of the aggregated directions), and each input image i is then projected4 onto S as id = Qd Q d i. Accuracies on {id} as a function of d are shown in the top row of Fig. 3.

The outcome is striking: it is evident that in many cases, classiﬁcation decisions have effectively already been made based on a relatively small number of features, corresponding to the most curved directions. The sensitivity of the nets along these directions, then, is clearly learned purposefully from the training data, and does largely generalise in testing, as seen. Note also that at this level of analysis, it essentially does not matter whether positively or negatively curved directions are chosen. Another important point emerges here. Since it is the high-curvature directions that are largely responsible for determining the nets classiﬁcation decisions, the nets should be vulnerable to adversarial attack along precisely these directions.

4.3 Link between classiﬁcation and adversarial directions

It has already been noted in [23] that adversarial attack vectors evince high components in subspaces spanned by high-curvature directions. We expand the analysis by repeating the procedure of Sec. 4.2 for various attack methods, to determine whether existing attacks are indeed exploiting the directions in accordance with the classiﬁer s reliance on them. Results are displayed in the bottom row of Fig. 3, and should be compared against the row above. The graphs in these ﬁgures illustrate the direct relationship between the fraction of adversarial norm in given subspaces and the corresponding usefulness of those subspaces for classiﬁcation. The inclusion of the saliency images of [25] alongside the attack methods makes explicit the fact that adversaries are themselves an exposure of the net s notion of saliency.

By now, two results hint at a simpler and more direct way of identifying bases of classiﬁcation/adversarial directions. First, a close inspection of the class-score curves sampled and displayed in Fig. 1 reveals a direct connection between the curvature of a direction near the origin and its derivative magnitude over a fairly large interval around it. Second, this observation is made more

4The mean training-set orthogonal component (I Qd Q d ) i can be added, but is approximately 0 in practice for data normalised by mean subtraction, as is the case here.

20 40 60 80

100 MNIST-Le Net

Sneg ,Train

Sneg , Test

Spos , Train

Spos , Test

Sneg pos , Train

Sneg pos , Test

Sflat , Train

Sflat , Test

100 200 300

CIFAR10-Ni N

100 200 300

CIFAR10-Alex Net

10 20 30 40

CIFAR100-VGG

MNIST-Le Net

Sneg , Mean Deep Fool

Sneg , Mean Saliency

Sneg , Mean FGSM

Sflat , Mean Deep Fool

Sflat , UAP

Sflat , Mean Saliency

Sflat , Mean FGSM

50 100 150 200 250 300

CIFAR10-Ni N

50 100 150 200 250 300

CIFAR10-Alex Net

5 10 15 20 25 30

CIFAR100-Ni N

Figure 3: Top row: Training and test classiﬁcation accuracies for various DCNs on image sets projected onto the subspaces described in Sec. 4.2, as a function of their dimensionality parameter d (from 0 until the input space is fully spanned). The principal directions deﬁning the subspaces are obtained by applying Alg. 1 once for each possible choice of target class c and retaining d directions per class. Note the relationship between the ordering of curvature magnitudes and classiﬁcation accuracy by comparing the Sﬂat curves to the others. Bottom row: Mean ℓ2-norms of various adversarial perturbations (Deep Fool [9], FGSM [8] and UAP [4]) and saliency maps [25] when projected onto the same subspaces as above, as a fraction of their original norms.

200 400 600

Accuracy (in %)

MNIST-Le Net

1000 2000 3000

CIFAR10-Ni N

1000 2000 3000

CIFAR10-Alex Net

1000 2000 3000

CIFAR100-VGG

IMAGENET-Alex Net

49.37 51.31

Shi, Test (100)

Slo, Test (100)

Test Accuracy Shi, Test (120)

Slo, Test (120)

Figure 4: Classiﬁcation accuracies on image sets projected onto subspaces of the spans of their corresponding Deep Fool perturbations. For each net-dataset pair, Deep Fool perturbations are computed over the image set and assembled into a matrix that is decomposed into its SVD. The singular vectors are ordered as per their singular values: Shi represents the high-to-low ordering, Slo the low-to-high, and d the number of vectors retained. Compare this ﬁgure to Fig. 3 (while noticing how d now counts the total number of directions). For the Image Net experiments, owing to memory constraints, the SVD is performed on downsampled Deep Fools of size 100 100 3 and 120 120 3, respectively. The resulting singular vectors span the entire effective classiﬁcation space of correspondingly downsampled images. This is evinced by the fact that the classiﬁcation accuracy of images projected onto the singular vectors subspace saturates to the same performance as that yielded when the net is tested directly on the downsampled images.

clear in Fig. 3 where it can be seen that the directions obtained by boundary curvature analysis in Alg. 1 correspond to the directions exploited by various ﬁrst-order methods. Thus, we hypothesise that to identify such a basis, one need actually only perform SVD on a matrix of stacked class-score gradients5. Here, we implement this using a collection of Deep Fool perturbations to provide the required gradient information, and repeat the analysis of Sec. 4.2, using singular values to order the vectors. The results, in Fig. 4, neatly replicate the previously seen classiﬁcation accuracy trends for high-to-low and low-to-high curvature traversal of image-space directions. Henceforth, we use these directions directly, simplifying analysis and allowing us to analyse Image Net networks.

5In fact, this analysis is begun in [4], but only the singular values are examined.

200 400 600

MNIST-Le Net

Confined Deep Fool Projected Image

200 400 600

Confined Deep Fool Projected Image

1000 2000 3000

CIFAR100-Alex Net

1000 2000 3000

IMAGENET-Alex Net

Figure 5: Blue curves depict the mean ℓ2-norms of "conﬁned Deep Fool" perturbations: those that are calculated under strict conﬁnement to the respective subspaces of Fig. 4, also detailed in Sec. 4.3. Note the differences in scale of the y-axes of the different plots. For MNIST and CIFAR, we also plot (in red) the mean norms of the projections of the input images onto those subspaces: observe the inverse relationship between the two curves. The columns on the right visualise, from top to bottom, sample images at the indicated points on the curves in the CIFAR100-Alex Net plots, from left to right: blue-bordered images represent conﬁned Deep Fool perturbations (rescaled for display), with their red-bordered counterparts displaying the projection of the corresponding sample CIFAR image onto the same subspace. Observe that when the human-recognisable object appearance is captured in any given subspace, the corresponding Deep Fool perturbation becomes maximally effective (i.e. small-norm). Likewise, when the projected image is not readily recognisable to a human, the Deep Fool perturbation is large. The feature space per se does not account for adversariality: the issue is in the net s response to the features.

While Fig. 3 displays the magnitudes of components of pre-computed adversarial perturbations in different subspaces, we also design a variation on the analysis to illustrate how effective an efﬁcient attack method (Deep Fool) is when conﬁned to the respective subspaces. This is implemented by simply projecting the gradient vectors used in solving Deep Fool s linearised problem onto each subspace before otherwise solving the problem as usual. The results, displayed in Fig. 5, thus represent Deep Fool s earnest attempts to attack the network as efﬁciently as possible within each given subspace. It is evident that the attack must exploit genuine classiﬁcation directions in order to achieve low norm.

dlow En{ℓ2 norm(in)} En{ℓ2 norm(δin)} Accuracy (%) Fooling rate (%) f = 1 f = 2 f = 3 f = 4 f = 5 f = 10 227 26798.72 63.96 57.75 100.00 100.00 100.00 100.00 100.00 100.00 200 26515.20 53.19 55.80 32.75 77.25 88.95 92.20 94.35 97.65 150 26327.03 46.86 53.50 35.55 58.35 77.90 85.95 89.25 95.65 120 26159.98 41.92 51.75 36.15 49.80 66.20 76.90 82.95 92.90 100 26008.02 37.98 48.10 41.65 49.25 59.95 68.05 74.80 88.30

Table 1: The images in used to train Alex Net operate at the scale of dorig = 227 (pixels on a side). In the pre-processing step, these images are downsized to dlow, before being upsampled back to the original scale. The reconstructed Deep Fool perturbations δin lose some of their effectiveness, as seen in the fooling-rate column for f = 1. When the effect of downsampling is countered by increasing the value of the ℓ2-norms of these perturbations (using higher values of f), their efﬁcacy is steadily restored. Note that the mean norms of images and perturbations are estimated in the upscaled space, as are the classiﬁcation accuracies. The accuracy values for dlow = {100, 120} should be compared to those at convergence in Fig. 4. Any difference in the performance scores is strictly due to the random selection of the subset of 2000 test images used for evaluation.

4.4 On image compression and robustness to adversarial attack

The above observations have made it clear that the most effective directions of adversarial attack are also the directions that contribute the most to the DCNs classiﬁcation performance. Hence, any attempt to mitigate adversarial vulnerability by discarding these directions, either by compressing the input data [14, 15, 16] or by suppressing speciﬁc components of image representations at intermediate network layers [5], must effect a loss in the classiﬁcation accuracy. Further, our framework anticipates the fact that the nets must remain just as vulnerable to attack along the remaining directions that

continue to determine classiﬁcation decisions, given that the corresponding class-score functions, which possess the properties discussed earlier, remain unchanged. We use image downsampling as an example data compression technique to illustrate this effect on Image Net.

We proceed by inserting a pre-processing unit between the DCN and its input at test time. This unit downsamples the input image in to a lower size dlow before upsampling it back to the original input size dorig. The resizing (by bicubic interpolation) serves to reduce the effective dimensionality of the input data. For a randomly selected set of 2000 Image Net [26] test images, we observe the change in classiﬁcation accuracy over different values of dlow, shown in column 4 of Table 1. The fooling rates6 for the downsampled versions of these natural images adversarial counterparts, produced by applying Deep Fool to the original network (without the resampling unit), follow in column 5 of the table. At ﬁrst glance, it appears that the downsampling-based pre-processing unit has afforded an increase in the network robustness at a moderate cost in accuracy. Results pertaining to this tradeoff have been widely reported [14, 15, 5]. Here, we take the analysis a step further.

To start, we note the fact that the methodology just described represents a transfer attack from the original net to the net as modiﬁed by the inclusion of the resampling unit. As Deep Fool perturbations δin are not designed to transfer in this manner, we ﬁrst augment them by simply increasing their ℓ2-norm by a scalar factor f. We adjust f from unity up to a point at which the mean Deep Fool perturbation norm is still a couple of orders of magnitude smaller than the mean image norm, such that the perturbations remain largely imperceptible. The corresponding fooling rates grow steadily with respect to f, as is observable in Table 1. Hence, although the original full-resolution perturbations may be suboptimal attacks on the resampling variants of the network (as some components are effectively lost to projection onto the compressed space), sufﬁcient rescaling restores their effectiveness. On the other hand, the modiﬁed net continues to be equally vulnerable along the remaining effective classiﬁcation directions, and can easily be attacked directly. To go about this, we simply take the SVD of the stack of downsampled Deep Fool perturbations, for dlow values of 100 and 120 (owing to memory constraints). The resulting singular vectors span the entire space of classiﬁcation/adversarial directions of the corresponding resampling network, as can be seen from the accuracy values in the rightmost subplot of Fig. 4. More crucially, lower-norm Deep Fools can be obtained by restricting the attack s iterative linear optimisation procedure to the space spanned by these compressed perturbations, exactly as described in Sec. 4.3 and displayed in Fig. 5. This subspace-conﬁned optimisation is analogous to designing a white-box Deep Fool attack for the new network architecture inclusive of the resampling unit, instead of the original network, and is as effective as before. Note that this observation is consistent with the results reported in [16], where the strength of the examined gradient-based attack methods increases progressively as the targeted model better approximates the defending model.

5 Conclusion

In this work, we expose a collection of directions along which a given net s class-score output functions exhibit striking similarity across sample images. These functions are nonlinear, but are de facto of a relatively constrained form: roughly axis-symmetric7 and typically monotonic over large ranges. We illustrate a close relationship between these directions and class identity: many such directions effectively encode the extent to which the net believes that a particular target class is or is not present. Thus, as it stands, the predictive power and adversarial vulnerability of the studied nets are intertwined owing to the fact that they base their classiﬁcation decisions on rather simplistic responses to components of the input images in speciﬁc directions, irrespective of whether the source of those components is natural or adversarial. Clearly, any gain in robustness obtained by suppressing the net s response to these components must come at the cost of a corresponding loss of accuracy. We demonstrate this experimentally. We also note that these robustness gains may be lower than they appear, as the network actually remains vulnerable to a properly designed attack along the remaining directions it continues to use. A discussion including some nuanced observations and connections to existing work that follow from our study can be found in the supplementary material. To conclude, we believe that for any scheme to be truly effective against the problem of adversarial vulnerability, it must lead to a fundamentally more insightful (and likely complicated) use of features than presently occurs. Until then, those features will continue to be the nets own worst adversaries.

6Measured as a percentage of samples from the dataset that undergo a change in their predicted label. 7Though not necessarily so for MNIST, because of its constraints: see supplementary material.

Acknowledgements. This work was supported by the ERC grant ERC-2012-Ad G 321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1. We would also like to acknowledge the Royal Academy of Engineering, Five AI, and extend our thanks to Seyed-Mohsen Moosavi-Dezfooli for providing his research code for curvature analysis of decision boundaries of DCNs.

[1] NIPS: 2017 competition on adversarial attacks and defenses. https://www.kaggle.com/ nips-2017-adversarial-learning-competition (2017) accessed: 2018-03-12.

[2] Moosavi-Dezfooli*, S.M., Fawzi*, A., Fawzi, O., Frossard, P., Soatto, S.: Robustness of classiﬁers to universal perturbations: A geometric perspective. In: International Conference on Learning Representations. (2018)

[3] Lin, M., Chen, Q., Yan, S.: Network in network. International Conference on Learning Representations (2013)

[4] Moosavi-Dezfooli*, S.M., Fawzi*, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE (2017) 86 94

[5] Gao, J., Wang, B., Lin, Z., Xu, W., Qi, Y.: Deepcloak: Masking deep neural network models for robustness against adversarial samples. In: International Conference on Learning Representations. (2017)

[6] Zhao, Q., Grifﬁn, L.D.: Suppressing the unusual: towards robust cnns using symmetric activation functions. Co RR abs/1603.05145 (2016)

[7] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: International Conference on Learning Representations. (2014)

[8] Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations. (2015)

[9] Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Number EPFL-CONF-218057 (2016)

[10] Sabour*, S., Cao*, Y., Faghri, F., Fleet, D.J.: Adversarial manipulation of deep representations. In: International Conference on Learning Representations. (2016)

[11] Nguyen, A., Yosinski, J., Clune, J.: Deep neural networks are easily fooled: High conﬁdence predictions for unrecognizable images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 427 436

[12] Stanley, K.O.: Compositional pattern producing networks: A novel abstraction of development. Genetic programming and evolvable machines 8(2) (2007) 131 162

[13] Wang, B., Gao, J., Qi, Y.: A theoretical framework for robustness of (deep) classiﬁers under adversarial noise. ar Xiv preprint ar Xiv:1612.00334 (2016)

[14] Maharaj, A.V.: Improving the adversarial robustness of convnets by reduction of input dimensionality (2015)

[15] Das, N., Shanbhogue, M., Chen, S., Hohman, F., Chen, L., Kounavis, M.E., Chau, D.H.: Keeping the bad guys out: Protecting and vaccinating deep learning with JPEG compression. Co RR abs/1705.02900 (2017)

[16] Xie, C., Wang, J., Zhang, Z., Ren, Z., Yuille, A.L.: Mitigating adversarial effects through randomization. Co RR abs/1711.01991 (2017)

[17] Tanay, T., Grifﬁn, L.: A boundary tilting persepective on the phenomenon of adversarial examples. ar Xiv preprint ar Xiv:1608.07690 (2016)

[18] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations (2018)

[19] Lu, J., Issaranon, T., Forsyth, D.: Safetynet: Detecting and rejecting adversarial examples robustly. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 446 454

[20] Metzen, J.H., Genewein, T., Fischer, V., Bischoff, B.: On detecting adversarial perturbations. ar Xiv preprint ar Xiv:1702.04267 (2017)

[21] Zhang, H., Cisse, M., Dauphin, Y., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations. (2018)

[22] Fawzi*, A., Moosavi-Dezfooli*, S.M., Frossard, P.: Robustness of classiﬁers: from adversarial to random noise. In: Advances in Neural Information Processing Systems. (2016) 1632 1640

[23] Fawzi*, A., Moosavi-Dezfooli*, S.M., Frossard, P., Soatto, S.: Classiﬁcation regions of deep neural networks. ar Xiv preprint ar Xiv:1705.09552 (2017)

[24] Fawzi, A., Moosavi-Dezfooli, S.M., Frossard, P.: The robustness of deep networks: A geometrical perspective. IEEE Signal Processing Magazine 34(6) (2017) 50 62

[25] Simonyan, K., Vedald, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classiﬁcation models and saliency maps. ar Xiv preprint ar Xiv:1312.6034 (2013)

[26] Russakovsky*, O., Deng*, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3) (2015) 211 252

[27] Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Volume 1., IEEE (2005) 886 893

[28] do Carmo, M.: Differential Geometry of Curves and Surfaces. Prentice-Hall (1976)