# biologically_inspired_mechanisms_for_adversarial_robustness__ddfa4e49.pdf

Biologically Inspired Mechanisms for Adversarial Robustness

Manish Reddy Vuyyuru Institute for Applied Computational Science Harvard University Cambridge, MA 02139 mvuyyuru@g.harvard.edu

Andrzej Banburski Center for Brains, Minds and Machines Massachusetts Institute of Technology Cambridge, MA 02139 kappa666@mit.edu

Nishka Pant Center for Brains, Minds and Machines Massachusetts Institute of Technology Cambridge, MA 02139 npant@mit.edu

Tomaso Poggio Center for Brains, Minds and Machines Massachusetts Institute of Technology Cambridge, MA 02139 tp@ai.mit.edu

A convolutional neural network strongly robust to adversarial perturbations at reasonable computational and performance cost has not yet been demonstrated. The primate visual ventral stream seems to be robust to small perturbations in visual stimuli but the underlying mechanisms that give rise to this robust perception are not understood. In this work, we investigate the role of two biologically plausible mechanisms in adversarial robustness. We demonstrate that the nonuniform sampling performed by the primate retina and the presence of multiple receptive ﬁelds with a range of receptive ﬁeld sizes at each eccentricity improve the robustness of neural networks to small adversarial perturbations. We verify that these two mechanisms do not suffer from gradient obfuscation and study their contribution to adversarial robustness through ablation studies.

1 Introduction

While modern convolutional neural networks (CNNs) have demonstrated remarkable performance in visual recognition, they still lag severely behind human vision on several tasks. Notably, despite intense effort in recent years, these models remain vulnerable to adversarial perturbations [1], even on standard datasets (e.g. Image Net). In contrast, human vision appears to be particularly robust to small perturbations in visual stimuli.

Simultaneously, an increasing body of computational and experimental work demonstrates the suitability of these artiﬁcial neural networks (ANNs) as models of the primate brain s ventral visual stream [2, 3, 4, 5]. This suggests an opportunity to investigate if aspects of biological vision, that are not yet incorporated into current computer vision models, improve adversarial robustness. If the approach is successful, it will improve the robustness of engineered models and provide strong support for ANNs being good models of primate vision. If the proposed approach does not ﬁnd any way to modify the existing models to reach human-level robustness, it could demonstrate that the broad family of feedforward deep networks should be rejected as models of primate vision.

Engineering Signiﬁcance Robust real-world adversarial examples that fool models across a wide range of views, angles and lighting have been demonstrated [6]. CNNs are increasingly being

A.B. and N.P. contributed equally to each other in this work.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

considered for use in safety critical applications (e.g. autonomous vehicles), thus the dramatic sensitivity of state-of-the-art models to tiny perturbations in otherwise benign inputs has clear security implications. Proposed methods for improving adversarial robustness have generally approached the problem from either a prevention or detection perspective. Some of the proposed mechanisms include defensive distillation [7], feature squeezing [8], defensive randomization [9], adversarial training [10], and several have drawn inspiration from biological vision [11, 12]. Faithfully evaluating adversarial robustness has proven to be non-trivial [13, 14, 15], with many proposed methods eventually being shown to be ineffective at making models robust [16, 13, 17, 18]. Adversarial training remains the most promising at increasing the robustness of models. However adversarial training comes with several downsides, ranging from signiﬁcantly increased computational cost, to preferential robustness against adversarial attacks that the model was trained on [17], to decreased standard accuracy [19] and ineffectiveness at improving robustness on examples in the low density regions of the training data distribution [20]. Mechanisms that help bring the models closer to human-level robustness and alleviate this tension between performance and security would have broad implications for the application of these networks.

The learning of feature representations that are particularly well suited to visual recognition tasks is a key characteristic of CNNs that has led to their widespread use. However, the features learned by current models that perform well in standard datasets can vary signiﬁcantly from features that are meaningful to humans [21]. Recent studies have demonstrated that the quality of these learnt representations are improved in adversarially robust models. In particular, many aspects of feature visualization and manipulation were better aligned with our notion of visual perception in robust models [22, 23]. This suggests that adversarial robustness holds intrinsic value, beyond security implications and potentially even at the cost of standard accuracy, as a prior for helping models learn representations that are more human meaningful and interpretable.

Biological Signiﬁcance Recent studies have argued that ANNs are suitable models of biological vision due to the similarity between internal representations in CNNs and in the primate brain [2, 4, 3]. In this paper, we explore two biological mechanisms that are not captured in current deep learning models of vision.

The uneven distribution of cones in the primate retina results in non-uniform spatial sampling of visual stimuli. The density of sampling is highest at a ﬁxation point on an image and decreases with distance from the ﬁxation point. In contrast, standard CNNs accept images sampled in a uniform square grid. Previous studies have demonstrated that incorporating the non-uniform sampling performed by the primate retina into standard networks improves predictivity of neural sites in the primate V4 cortical area, allows for better neural population control via controller images [3] and helps in generating adversarial examples that impact the accuracy of time-limited humans [24]. The ﬁrst mechanism that we investigate incorporates information across multiple retinal ﬁxation points.

The receptive ﬁeld size along the primate visual stream increases with eccentricity [25, 26, 27]. From a sampling perspective, this translates to several scale-space image fragments centered on a ﬁxation point in an image. Previous studies have argued for and demonstrated the computational role of pooling over the scale-space fragments in invariant visual recognition [28]. This second multiple scales mechanism allows for translation, scale and clutter invariance to be incorporated into neural networks [29, 30, 31], with much lower sample complexity than standard data augmentation methods. We investigate incorporating information across multiple "cortical ﬁxations".

Contributions We demonstrate that two mechanisms inherent to primate vision (the proposed mechanisms ) consistently improve the adversarial robustness of neural networks to small adversarial perturbations across a range of PGD variants, hyperparameters and adversarial criteria (by about 0% to 30% for ϵ 0.02). One of the mechanisms (retinal ﬁxations) improves robustness at almost no cost to standard performance (+1.62% on Image Net). Through ablation studies, we also identiﬁed the key features of each mechanism that contribute to robustness. Our results suggest that biologically inspired mechanisms are promising candidates for improving robustness of standard neural networks. It also probes more broadly and provides support for the hypothesis that ANNs are good models of biological vision.

Figure 1: Left Top: Distributions of sampling points for 5 different retinal ﬁxations. Red dots represent pixels that would be sampled from the original standard image to form the retina sampled image. Left Bottom: Effect of retinal sampling on an image of a ﬂat checkerboard. Images presented were re-sampled at 5 different ﬁxation points. Right Top: Shown in red, the centering of scale-space fragments on 5 different cortical ﬁxations on the image. Right Bottom: The resulting 4 scale-space fragments from 1 ﬁxation point. For a single ﬁxation point, the 4 scale-space fragments result from crops of varying sizes but are all Gaussian downsampled to the size of the smallest. 2 Methods

Datasets The experiments were spread across 4 datasets. CIFAR10 is a small, standard dataset [32] and was used to benchmark results against other published results on adversarial robustness. The proposed mechanisms excel with images of much higher resolution. The majority of the experiments in this work were thus carried out on Image Net10 and some experiments were repeated on Image Net100 and Image Net to study the scalability of the proposed mechanisms.

The Image Net dataset [33] offers high resolution images split into 1000 classes that span a range of breadth (e.g. airliner, strawberry, etc.) and depth (e.g. mud turtle, leatherback turtle, etc.). 10 classes were hand picked to construct the Image Net10 dataset. The classes were chosen to be visually distinct and of natural objects (e.g. pandas, snakes, etc.). To construct the Image Net100 dataset, 100 classes were randomly chosen. Images in the Image Net dataset are of varying dimensions. To standardize, all images in the Image Net10, Image Net100 and Image Net datasets were formed only from the central 320x320 regions of the original images. For the full Image Net dataset, in the interest of keeping the compute time for experiments reasonable, models were trained on the full training set but robustness evaluations were carried out on a test set downsampled to 5 images per class (totalling 5000 images for the 1000 classes). See Section 1 in the supplementary material for the preprocessing steps and the list of Image Net classes comprising the Image Net10 and Image Net100 datasets.

Biologically Inspired Mechanisms The ﬁrst mechanism is the non-uniform spatial sampling of visual stimuli by the photoreceptors. In the primate retina, the density of cones is maximum at the center of the fovea and decreases with eccentricity. The maximum density corresponds to a sampling distance of about 27 seconds of arc between adjacent cones, which corresponds almost exactly to the Shannon sampling limit imposed by the diffraction limited optics with a cutoff around 60 cycles/degree. The implementation for retinal sampling was adapted from [3] where it was tuned to mimic the exponentially decreasing density of cones with eccentricity. In [3], g is deﬁned as a function that maps points from the re-sampled image (r ) to the original image (r). As part of the transformation, the pixel coordinates in the image grid (x, y) were mapped to polar (r, θ). To work with arbitrary ﬁxation points, we re-centered this transform at a ﬁxation point instead of at the Cartesian origin. We present here a visualization of the distribution of the sampling points (Figure 1: Top Left) and example retinal sampled images at 5 different ﬁxation points (Figure 1: Bottom Left). See Section 2.1 in the supplementary material for additional details.

The size of receptive ﬁelds also varies with eccentricity, presumably to avoid aliasing [25, 26, 27]. The second mechanism is that V1 neurons show a range of scales at each eccentricity [28]. The main computational reason for this non-uniform sampling and the existence of a set of spatial scales is to enable processing of images with translation invariance small but growing with larger receptive ﬁelds and a large range of scale invariance. We assume here the estimate by [34] with 5 frequency channels having in the fovea receptive ﬁeld with a diameter of 2s = 1 20 , 3.1 , 6.2 , 11.7 , 21 , covering a range of roughly 1 to 20. Following from previous studies [28, 29, 30], we implemented the estimates for the set of scale-space fragments in V1 (i.e. the frequency channels estimated by [34]) by taking multiple crops that are progressively larger and Gaussian downsampling them to the dimensions of the smallest crop. We employed 4 scales of dimensions 40x40, 80x80, 160x160 and 240x240 that were all downsampled to 40x40 for Image Net. For CIFAR10, we only employed 2 scales (15x15 and 30x30 that were both downsampled to 15x15) because of the signiﬁcantly lower

Figure 2: Model A (baseline): standard CNN. Model B (baseline): coarse ﬁxations . Model C (effect): retinal ﬁxations . Model D (effect): cortical ﬁxations . Red marks on images in second row indicate approximate action of mechanisms centered on a ﬁxation point. 2 ﬁxation points visualized for Models B and C, 1 ﬁxation point visualized for model D.

resolution of images in the dataset (32x32 in CIFAR10 vs 320x320 for Image Net). We present here a visualization of the centering of the scale-space fragments at various ﬁxation points in an image (Figure 1: Top Right) and the resulting scales from one of the ﬁxation points (Figure 1: Bottom Right). See Section 2.2 in the supplementary material for additional details.

When making a prediction for a given ﬁxation point, models incorporating cortical sampling have to incorporate information across the different scales. Each scale is fed to a separate CNN which does not end with a dense layer. We concatenate the ﬁnal latent vector from each branch, and this concatenated representation is then passed through a single dense layer for classiﬁcation. During training time, we used an auxiliary loss to ensure that each branch was also predictive of the label on its own. See Sections 5.11 - 5.13 in the supplementary material for results from some preliminary experiments with alternate concatenation mechanisms (e.g. max, average pooling across scales, etc.).

Models Models for the CIFAR10 and Image Net datasets were based off the standard CIFAR Res Net20 and Image Net Res Net-18 architectures [35].

We used a standard model and a coarse ﬁxations model as baselines (Model A and Model B in Figure 2). The coarse ﬁxations model approximates crudely the effect of ﬁxations by applying a standard network to different image regions. This is effectively the same as standard 10-crop testing in [35] without image ﬂipping.

The retinal ﬁxations model applies a standard network to an image that is non-uniformly sampled as previously described (Model C in Figure 2). In the cortical ﬁxations model, the standard Res Net architecture is split over multiple branches that process the scales at each eccentricity independently. The number of ﬁlters for each branch were chosen such that the total compute cost of a forward pass of all the branches costs about the same as a standard Res Net. The striding in early layers of the Res Net architecture was adjusted to account for the small dimensions of each scale-space fragment. We used an auxiliary loss during training to encourage information across all the scales to contribute to the model prediction (Model D in Figure 2).

During evaluation time, information was incorporated across ﬁxations by averaging the model logits from 5 ﬁxation points for each mechanism (top left, top right, bottom left, bottom right, middle).

Training The models for coarse, retinal and cortical ﬁxations depend on a ﬁxation point on the image to center the mechanism on. During training, a ﬁxation point is randomly sampled from the valid set of ﬁxations for the mechanism. This is in contrast to behavior at evaluation time, when information across ﬁxations is incorporated by averaging model logits over 5 pre-determined ﬁxation points for each mechanism. See Section 3 in the supplementary material for the valid and chosen ﬁxations for each mechanism. Image augmentation was standardized across all models and datasets (random crops and random left/right ﬂips). No color augmentation was used.

Models for CIFAR10 and Image Net10 were both trained with an ADAM optimizer with β1 = 0.9, β2 = 0.999 and an initial learning rate of 0.001. CIFAR10 Models were trained for 200 epochs with a batch size of 180 and a ﬁxed learning schedule (decay from initial by 0.1, 0.01, 0.001, 0.0005 at epoch 80, 120, 160, 180). Models for Image Net10 were trained for 400 epochs with a batch size of 64 and a ﬁxed learning schedule (decay from initial by 0.1, 0.01, 0.001, 0.0005 at epoch 160, 240, 320, 360). Models for Image Net100 and Imagenet used an SGD optimizer with weight decay of 0.0001, momentum 0.9, initial learning rate of 0.1, and a batch size 256. Models for Image Net100 were trained for 130 epochs with a ﬁxed learning schedule (decay from initial by 0.1, 0.01, 0.001, 0.0005 at epoch 30, 70, 90, 120). Models for Image Net were trained for 90 epochs with a ﬁxed learning schedule (decay from initial by 0.1, 0.01, 0.001 at epoch 30, 60, 80).

Adversarial Robustness We followed previously proposed guidelines [13, 14, 15] when evaluating robustness. The adversarial attacks used were as implemented in the Python package Foolbox [36].

Projected Gradient Descent (PGD) [37] was motivated by previous works as a universal ﬁrst order adversary that provides a suitable security guarantee against ﬁrst-order attacks [10, 38]. In this work, the adversarial robustness of the proposed mechanisms was investigated primarily with PGD. PGD (L variant) prescribes the generation of adversarial examples with the following iterative scheme:

xadv,i = CLIPx,ϵ(xi 1 + λSIGN( x L(...))), xadv,0 = xoriginal where i is the count of iterations, CLIP is an operation that clips x back to the permissible set, λ is the step size and x L(...) is the gradient of the relevant loss function for the attack. The gradient was always fully propagated through the proposed mechanisms.

When setting a maximum perturbation size ϵ with PGD, we need to choose a distance metric. Various Lp norms of the distance from the adversarial image to the original image are usually employed, |xadv xoriginal|p, where typically p = 2 or p = [16]. The usual distance metrics are not necessarily well aligned with human perceived similarity. There is ongoing work in that area that would be interesting to consider in the future [39]. We mostly used L PGD and L2 PGD in our experiments, but also checked robustness to L1 PGD and the fast gradient sign method (FGSM) [36].

We evaluated the robustness of the models by studying how the accuracy varied as ϵ, the maximum perturbation size, was increased. We calculate the accuracy as 1 - ((num. naturally misclassiﬁed + num. adversarial examples) / num. images in test set). Whether an image is naturally misclassiﬁed was always determined by whether the true class was the most likely class predicted by the model, regardless of the adversarial criteria. Adversarial attacks were then only run against images that were not naturally misclassiﬁed. In most of the experiments, we set the step size λ to ϵ/3 and ran 5 iterations. This allows PGD to reach the edge of the permissible set and explore the boundary while keeping compute time for the experiments reasonable. We also conducted some experiments with 20 to 1000 iterations, step sizes λ of ϵ/20 to ϵ/3, tried setting λ dynamically with an ADAM optimizer, etc. See Section 4 and 5.1 in the supplementary material for additional details.

We compared the test classiﬁcation error to study standard performance. We refer to the performance of the models on the unperturbed natural test sets as the standard performance. The coarse ﬁxations model generally outperformed the standard Res Net models, which is in line with previous studies [35]. The coarse ﬁxations model underperformed on CIFAR10 likely due to the small size of images. The

Table 1: Standard TOP-1 performance of baselines and proposed models on various datasets.

MODEL CIFAR10 IMAGENET10 IMAGENET100 IMAGENET

STANDARD RESNET 88.13% 89.4% 76.80% 59.46% COARSE FIXATIONS 87.70% 91.2% 78.42% 61.28% RETINAL FIXATIONS 88.88% 90.2% 78.22% 62.90% CORTICAL FIXATIONS 85.16% 88.6% 73.62% 56.32%

retinal ﬁxations model performed about as well as the best baseline model (worse by only 0.2% on Image Net100 and better by 1.62% on Image Net) but the cortical ﬁxations model decreased standard performance (underperforms by 4.96% on Image Net). The relative performance of the models was fairly consistent across datasets. The performance penalties for the cortical ﬁxations model were

Figure 3: Robustness of the models on various datasets with increasing PGD perturbation budget. Top Row: robustness to 5-step L PGD standard adversarial examples. Bottom Left: robustness to strongly misclassiﬁed adversarial examples, Bottom Middle: robustness to 20-step PGD, Bottom Right: robustness to L2 PGD.

likely still less than for adversarial training. For example, L PGD adversarial training on CIFAR10 decreases standard performance by about 7.9% [10].

3.1 Adversarial Robustness

The adversarial robustness of the proposed mechanisms was evaluated by studying the change in accuracy of the models with increasing perturbation budget. The experiments span a range of PGD variants, hyperparameters and adversarial criteria (Figures 3, 4).

The biologically inspired mechanisms consistently improved adversarial robustness to small perturbations across almost all experiments on all datasets (see Figure 4). The extent of improvement varied depending on the attack variants, hyperparameters and the dataset used. The retinal ﬁxations and cortical ﬁxations models improved robustness to about the same extent, with the best model varying across experiments. At larger perturbations, the mechanisms did not signiﬁcantly impact robustness. See Section 5.9, 5.10 in the supplementary material for a preliminary inspection of the quality and visibility of the adversarial perturbations.

For small perturbations on Image Net10, we observed a greater improvement in accuracy for the proposed models from the best baseline when using PGD as opposed to FGSM. This suggests that the mechanisms are effective at improving robustness, with the retinal and cortical ﬁxations models performing proportionately better against stronger attacks that more thoroughly probe the robustness. The improvement in accuracy was relatively unchanged when increasing the number of iterations of PGD from 5 to 20 with step size λ of 0.1 or 0.025. In general, the proposed models showed greater improvement in accuracy under L PGD versus L2 PGD, with the least improvement with L1 PGD. Further studies could be conducted on variations in improvement with attack variants and hyperparameters to better understand the contribution of the mechanisms to adversarial robustness. See sections 5.2 - 5.8, 5.11 - 5.14 in the supplementary material for additional details. Conﬁdent Misclassiﬁcations On Image Net10, the robustness to standard misclassiﬁcations (true class not the most likely predicted class) was compared to the robustness to conﬁdent misclassiﬁcations (true class not in top 3 for untargeted attacks and adversarial class predicted with probability >80% for targeted attacks). Generally, the proposed mechanisms were more robust to conﬁdent mistakes than standard mistakes for untargeted attacks. The robustness to conﬁdent and standard mistakes for targeted attacks was about the same.

In a targeted setting, the loss used for PGD pushes towards a conﬁdent misclassiﬁcation (objective is to increase probability of adversarial class) whereas in an untargeted setting, the loss does not explicitly push towards conﬁdent misclassiﬁcations (objective is to minimize true class, this can be accomplished either by increase probability of only 1 or any number of other classes). This suggests

Figure 4: Improvement in accuracy of proposed models from the best baseline at small ϵ. Each dot represents a different attack conﬁguration. Top row: Image Net10. Bottom row: Image Net100.

that the proposed models improve robustness to conﬁdent misclassiﬁcations as conﬁdent mistakes were less likely to occur unless they were explicitly optimized for.

Adversarial Training On CIFAR10, the robustness of the models was benchmarked against an adversarially trained model available online from a previous study [10]. The model uses a variant of the standard Res Net architecture, w32-10 Res Net. We benchmarked the robustness of the models with L PGD, which is also the attack that was used for adversarial training (Figure 3).

The robustness of the retinal ﬁxations and cortical ﬁxations models was consistently better than the baselines but worse than adversarial training. The accuracy of the models that were not adversarially trained dips to approximately 0% for perturbations of size ϵ = 0.05 and higher. The proposed models made far fewer conﬁdent misclassiﬁcations than the baselines, with the robustness of the proposed models to conﬁdent misclassiﬁcation at ϵ = 0.005 approaching adversarial training.

3.2 Ablation Experiments

We performed experiments with a series of ablated models to study the contribution of various aspects of the proposed mechanisms to robustness. The experiments were conducted on Image Net10 and span a range of PGD variants, hyperparameters and adversarial criteria (Figure 5). All ablated models employed a mechanism centered on ﬁxations. As before, we incorporated information across ﬁxations at evaluation time by averaging the model logits from 5 pre-determined ﬁxation points in the image.

Figure 5: Improvement in accuracy of ablated models over the best baseline model under L PGD with ϵ = 0.01. Each dot represents a different attack conﬁguration/criteria.

Retinal Ablation Non-uniform retinal sampling effectively sub-samples and up-samples the image. The pixels that get sub-sampled and up-sampled are chosen based on their distance to a point of ﬁxation on the image. There are several factors that could potentially account for the improved robustness of the retinal ﬁxations model, such as the sub-sampling or the sub-sampling combined with up-sampling or the non-uniformity of the sampling.

The retinal ﬁxations model consistently outperformed the coarse ﬁxations model in robustness tests. This indicates that simply sub-sampling, or applying a network on different image regions, does not improve robustness. We studied an ablated model that performs uniform sub-sampling (exactly as the coarse ﬁxations model) followed by up-sampling. This ablated model ( uniform resampling in Figure 5) consistently underperformed compared to the best baseline model. This indicates that uniformly sub-sampling and up-sampling an image does not improve (actually worsens) robustness.

This leaves the non-uniformity of the sampling as the main contributing factor to improving robustness. We speculate that the non-uniform sampling appropriately conditions the model to learn larger scale features during the training process and that these features are more robust to adversarial perturbations. This could be investigated further by studying the extent of robust versus non-robust features learnt by the model as in [21].

Cortical Ablation The cortical ﬁxations model breaks an image into multiple scale-space fragments centered on a given ﬁxation point. These fragments are constructed with a uniform square crop of increasing size followed by a Gaussian downsample of all fragments to the dimensions of the smallest fragment. The fragments are then fed through separate branches of a network and the linear classiﬁer at the end of the network uses information from all branches. Potential factors contributing to the improved robustness include Gaussian blurs on sub-samples of the image, Gaussian downsampling on sub-samples of the image, classifying with a ﬁxed computational budget distributed over separate branches of a network that combine their output, or a combination of the factors.

Classifying with a branched network ( ensembling ablated model in Figure 5) worsened or at best did not change the robustness of the models. The Gaussian blur ablated model performs the sub-sampling exactly as the coarse ﬁxations model, but applies a Gaussian blur on the images. This ablated model was more robust than the baselines but only to a small extent. It has been noticed previously that denoising operations (e.g. Gaussian blur) contribute to adversarial robustness [40, 41].

The Gaussian downsample ablated model is effectively the branch of the cortical ﬁxations model that corresponds to the largest receptive ﬁeld size. It performs sub-sampling similar to the coarse ﬁxations model and applies a Gaussian downsample. This ablated model was more robust than the baselines but less robust than the cortical ﬁxations model. This suggests that increasing reliance on large scale features in an image for classiﬁcation partially contributed to the improved robustness of the cortical ﬁxations models.

This leaves the ensembling across different scale-space fragments as the remaining contributing factor to the robustness of the cortical ﬁxations model. It is possible that the fragments are sufﬁciently de-correlated to make ensembling effective, unlike in the case of the ensembling ablated model. We tried several versions of the cortical ﬁxations model that incorporated information across the different scale-space fragments in various ways (see Section 5.12 in supplementary material). Some methods (e.g. pooling across branches) outperformed the current cortical ﬁxations model in several experiments. This suggests that issues remain with properly incorporating features across the scalespace fragments in the cortical ﬁxations model.

3.3 Gradient Obfuscation And Other Control Experiments

A common pitfall when evaluating adversarial robustness is gradient obfuscation [13]. By masking the gradient, models can appear to be robust under white-box attacks (e.g. PGD) which use the gradient to construct an adversarial example. We veriﬁed that the proposed mechanisms did not suffer from gradient obfuscation through previously suggested experiments [14]. Additionally, the measured robustness of models to adversarial attacks can be sensitive to the choice of attack hyperparameters such as the attack distance metric [17]. We compared the robustness of the models using a broad set of attack hyperparameters and also veriﬁed that our attacks had sufﬁciently converged [14]. See Sections 5.2 - 5.8, 5.14 and 6 in the supplementary material for additional details.

0% Accuracy On Unbounded Attacks With an unbounded adversarial attack, the accuracy of all models should always drop to 0. This is for the simple reason that an unbounded adversarial

perturbation can transform any image to any other image. Therefore, if the gradient information used in PGD is useful in constructing an adversarial example, the accuracy of any model should always be brought to 0. We observed this repeatedly when studying the accuracy of the models with increasing PGD perturbation budget. For example, the accuracy of all models drops to 0% for ϵ = 0.5 under 5-step L PGD on Image Net10 (Figure 3).

Iterative Attacks Outperform Single-Step Attacks For models that do not suffer from obfuscated gradients, taking multiple steps repeatedly in the direction of gradient ascent should be more useful or at least as useful as taking a single step when generating adversarial examples [14]. We compared the robustness of the models against FGSM, which is exactly equivalent to a 1-step L PGD, and against 5-step L PGD for a range of ϵ from 0.001 to 0.5. The accuracy of the retinal ﬁxations and cortical ﬁxations models under PGD was always strictly lower or the same as under FGSM. This suggests that the models do not suffer from gradient obfuscation since PGD was more successful at ﬁnding adversarial examples. See Section 6.2 in the supplementary material for additional details.

White-Box Attacks Outperform Black-Box Attacks The gradient propagated through the models should be more useful for constructing adversarial examples than a crude approximation of the gradient. We compared PGD, which uses the exact gradient to construct adversarial examples (white-box attack), to three other attack algorithms that instead use a crude approximation of the gradient (black-box attacks). We constructed adversarial examples using a standard Res Net model as a substitute for the retinal ﬁxations model and cortical ﬁxations model (transfer attack). The examples generated with transfer attack heavily underperformed compared to PGD. In further experiments, PGD also outperformed boundary-attack [42] and backward pass differentiable approximation [13]. See Sections 6.3 in the supplementary material for additional details.

Attack Hyperparameters The measured robustness of models can be sensitive to the attack conﬁguration used for evaluation [14]. For example, see [17] s critique of [10] under L0 or L2 PGD. We studied the improvement in robustness with the proposed mechanisms using a broad set of hyperparameters. This set included various attack distance metrics (L1, L2, L ), starting attacks with a random offset from the natural images, dynamic attack step sizes (ADAM), etc. Models were evaluated against each attack conﬁguration using a range of ϵ from 0.001 to 0.5. Across almost all experiments, the proposed mechanisms were more robust to small perturbations than the best baseline model. See Sections 5.2 - 5.8 in the supplementary material for additional details.

Attack Convergence To keep the compute time reasonable, we used attacks with 5 or 20 steps in most of our experiments with step sizes of ϵ/3 or ϵ/12. To show that the trends in robustness seen in our experiments were representative of the general trend, we ran L PGD for up to 1000 iterations using a step size of ϵ/20 for a range of ϵ from 0.001 to 0.02. We demonstrated that despite greatly increasing the number of steps and decreasing the step size, the drop in accuracy for all models was minimal. This suggests that the attacks in our experiments with a smaller number of iterations and larger step sizes were already sufﬁciently converged and strongly representative of the general trend. See Section 5.6 in the supplementary material for additional details.

4 Conclusion

In this work, we showed that two key features of primate vision foveation due to non-uniform distribution of cones in the retina and multiscale ﬁltering because of receptive ﬁelds of different sizes in V1 at each eccentricity consistently improve the robustness of neural networks to small adversarial perturbations. These mechanisms have negligible computational effect and one of them (non-uniform sampling) improves robustness at almost no cost in recognition performance. However, these mechanisms do not improve robustness to large perturbations. Preliminary inspection of the adversarial examples leads us to speculate that the inability of human perception to notice that the adversarial images are different from the normal ones breaks roughly around or above the transition in our experiments from small to large perturbation. In that case, our results suggest that the two mechanisms we identiﬁed may partially explain the robustness of primate vision to small perturbations, while an additional, separate mechanism akin to an anomaly detector may detect the presence of large perturbations.

In short, our results support the hypothesis that ANNs are suitable as core models of object recognition in primates. They also imply that implementing more biologically inspired mechanisms is promising for increasing adversarial robustness without forgoing standard performance.

Acknowledgements

We thank Aleksander Madry and Fernanda De La Torre for illuminating discussions.

Broader Impact

In terms of ethical aspects and future societal consequences, achieving adversarially robust models is of critical importance to deployment of autonomous technologies we can trust. Among some of the positive outcomes we could count autonomous vehicles immune to misleading by malicious agents, better trust in medical imaging applications and any other discipline to which AI can be applied and which requires safety guarantees. This naturally also comes with negative impacts, from reducing jobs available to humans, to potentially making surveillance technologies much more difﬁcult to avoid, allowing authoritarian regimes a much tighter control over their citizens, as well as enabling progress in autonomous weapon systems. Additionally, every defense against adversarial attacks that has been so far proposed has eventually been found to be vulnerable. If such vulnerabilities are found in increasingly more accurate models of primate vision, this could suggest the possibility of the existence of dynamically changing adversarial attacks that would fool humans, leading to potentially new camouﬂage technologies.

Funding Disclosure

Part of the funding is from Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216, and part by a grant from Lockheed Martin.

[1] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. intriguing properties of neural networks . International Conference on Learning Representations, 2014.

[2] Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J. Majaj, Rishi Rajalingham, Elias B. Issa, Kohitij Kar, Pouya Bashivan, Jonathan Prescott-Roy, Kailyn Schmidt, Daniel L. K. Yamins, and James J. Di Carlo. Brain-score: Which artiﬁcial neural network for object recognition is most brain-like? bio Rxiv preprint, 2018.

[3] Pouya Bashivan, Kohitij Kar, and James J. Di Carlo. Neural population control via deep image synthesis. Science, 364, 2019.

[4] Daniel L K Yamins and James J Di Carlo. Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience, 2016.

[5] Alexander Berardino, Johannes Ballé, Valero Laparra, and Eero P. Simoncelli. Eigen-distortions of hierarchical representations. Advances in Neural Information Processing Systems, 2017.

[6] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. International Conference on Machine Learning, 2018.

[7] Nicolas Papernot, Patrick Mc Daniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. IEEE Symposium on Security and Privacy, 2016.

[8] Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. Network and Distributed Systems Security Symposium, 2018.

[9] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. Mitigating adversarial effects through randomization. International Conference on Learning Representations, 2018.

[10] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations, 2018.

[11] Daniel Zoran, Mike Chrzanowski, Po-Sen Huang, Sven Gowal, Alex Mott, and Pushmeet Kohli. Towards robust image classiﬁcation using sequential attention models. Conference on Computer Vision and Pattern Recognition, 2020.

[12] Yan Luo, Xavier Boix, Gemma Roig, Tomaso Poggio, and Qi Zhao. Foveation-based mechanisms alleviate adversarial examples. ar Xiv preprint, 2016.

[13] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. International Conference on Machine Learning, 2018.

[14] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. ar Xiv preprint, 2019.

[15] Jonathan Uesato, Brendan O Donoghue, Aaron van den Oord, and Pushmeet Kohli. Adversarial risk and the dangers of evaluating against weak attacks. International Conference on Machine Learning, 2018.

[16] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. ar Xiv preprint, 2018.

[17] Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the ﬁrst adversarially robust neural network model on mnist. International Conference on Learning Representations, 2019.

[18] Warren He, James Wei, Xinyun Chen, Nicholas Carlini, and Dawn Song. Adversarial example defenses: Ensembles of weak defenses are not strong. ar Xiv preprint, 2017.

[19] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. robustness may be at odds with accuracy . International Conference on Learning Representations, 2019.

[20] Huan Zhang, Hongge Chen, Zhao Song, Duane Boning, Inderjit S. Dhillon, and Cho-Jui Hsieh. the limitations of adversarial training and the blind-spot attack . International Conference on Learning Representations, 2019.

[21] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. Advances in Neural Information Processing Systems, 2019.

[22] Shibani Santurkar, Andrew Ilyas, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Image synthesis with a single (robust) classiﬁer. Advances in Neural Information Processing Systems, pages 1260 1271, 2019.

[23] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, and Aleksander Madry. Adversarial robustness as a prior for learned representations. ar Xiv preprint, 2019.

[24] Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian Goodfellow, and Jascha Sohl-Dickstein. Adversarial examples that fool both computer vision and time-limited humans. Advances in Neural Information Processing Systems, 2018.

[25] J. Freeman and Eero Simoncelli. Metamers of the ventral stream. Nature Neuroscience, 14:1195 1201, 2011.

[26] R Gattass, C G Gross, and J H Sandell. Visual topography of v2 in the macaque. The Journal of Comparative Neurology, 1981.

[27] R Gattass, AP Sousa, and CG Gross. Visuotopic organization and extent of v3 and v4 of the macaque. Journal of Neuroscience, 1988.

[28] Tomaso Poggio, Jim Mutch, and Leyla Isik. Computational role of eccentricity dependent cortical magniﬁcation. CBMM Memo, 2014.

[29] Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix, and Tomaso Poggio. Eccentricity dependent deep neural networks: Modeling invariance in human vision. Association for the Advancement of Artiﬁcial Intelligence, 2017.

[30] Anna Volokitin, Gemma Roig, and Tomaso Poggio. Do deep neural networks suffer from crowding? Advances in Neural Information Processing Systems, 2017.

[31] Yena Han, Gemma Roig, Gad Geiger, and Tomaso Poggio. Scale and translation-invariance for novel objects in human vision. Scientiﬁc Reports, 10, 12 2020.

[32] A Krizhevsky. Learning multiple layers of features from tiny images. Technical Report, 2009.

[33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, Bernstein M., A.C. Berg, and Fei-Fei L. Large scale visual recognition challenge. Image Net, ar Xiv:1409.0575, 2014.

[34] E. Hildreth D. Marr, T. Poggio. Smallest channel in early human vision. Journal of the Optical Society of America, 1980.

[35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Conference on Computer Vision and Pattern Recognition, 2016.

[36] Jonas Rauber, Wieland Brendel, and Matthias Bethge. Foolbox: A python toolbox to benchmark the robustness of machine learning models. ar Xiv preprint ar Xiv:1707.04131, 2017.

[37] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. International Conference on Learning Representations, 2017.

[38] Nicholas Carlini, Guy Katz, Clark Barrett, and David L. Dill. Provably minimally-distorted adversarial examples. ar Xiv preprint, 2018.

[39] Zhengyu Zhao, Zhuoran Liu, and Martha Larson. Towards large yet imperceptible adversarial image perturbations with perceptual color distance. ar Xiv preprint, 2019.

[40] Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan Yuille, and Kaiming He. Feature denoising for improving adversarial robustness. Conference on Computer Vision and Pattern Recognition, 2019.

[41] Xin Li and Fuxin Li. Adversarial examples detection in deep networks with convolutional ﬁlter statistics. International Conference on Computer Vision, 2017.

[42] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. International Conference on Learning Representations, 2018.