# intriguing_properties_of_vision_transformers__113cff88.pdf

Intriguing Properties of Vision Transformers

Muzammal Naseer ? Kanchana Ranasinghe+? Salman Khan?

Munawar Hayat Fahad Shahbaz Khan? Ming-Hsuan Yang r

Australian National University, ?Mohamed bin Zayed University of AI, +Stony Brook University,

Monash University, Linköping University, University of California, Merced,

Yonsei University, r Google Research

muzammal.naseer@anu.edu.au

Vision transformers (Vi T) have demonstrated impressive performance across numerous machine vision tasks. These models are based on multi-head self-attention mechanisms that can ﬂexibly attend to a sequence of image patches to encode contextual cues. An important question is how such ﬂexibility (in attending image-wide context conditioned on a given patch) can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three Vi T families and provide comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of Vi T: (a) Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on Image Net even after randomly occluding 80% of the image content. (b) The robustness towards occlusions is not due to texture bias, instead we show that Vi Ts are signiﬁcantly less biased towards local textures, compared to CNNs. When properly trained to encode shape-based features, Vi Ts demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c) Using Vi Ts to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d) Off-the-shelf features from a single Vi T model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classiﬁcation datasets in both traditional and few-shot learning paradigms. We show effective features of Vi Ts are due to ﬂexible and dynamic receptive ﬁelds possible via self-attention mechanisms. Code: https://git.io/Js15X.

1 Introduction

As visual transformers (Vi T) attract more interest [1], it becomes highly pertinent to study characteristics of their learned representations. Speciﬁcally, from the perspective of safety-critical applications such as autonomous cars, robots and healthcare; the learned representations need to be robust and generalizable. In this paper, we compare the performance of transformers with convolutional neural networks (CNNs) for handling nuisances (e.g., occlusions, distributional shifts, adversarial and natural perturbations) and generalization across different data distributions. Our in-depth analysis is based on three transformer families, Vi T [2], Dei T [3] and T2T [4] across ﬁfteen vision datasets. For brevity, we refer to all the transformer families as Vi T, unless otherwise mentioned.

We are intrigued by the fundamental differences in the operation of convolution and self-attention, that have not been extensively explored in the context of robustness and generalization. While convolutions excel at learning local interactions between elements in the input domain (e.g., edges and contour information), self-attention has been shown to effectively learn global interactions (e.g.,

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Figure 1: We show intriguing properties of Vi T including impressive robustness to (a) severe occlusions, (b) distributional shifts (e.g., stylization to remove texture cues), (c) adversarial perturbations, and (d) patch permutations. Furthermore, our Vi T models trained to focus on shape cues can segment foregrounds without any pixel-level supervision (e). Finally, off-the-shelf features from Vi T models generalize better than CNNs (f).

relations between distant object parts) [5, 6]. Given a query embedding, self-attention ﬁnds its interactions with the other embeddings in the sequence, thereby conditioning on the local content while modeling global relationships [7]. In contrast, convolutions are content-independent as the same ﬁlter weights are applied to all inputs regardless of their distinct nature. Given the content-dependent long-range interaction modeling capabilities, our analysis shows that Vi Ts can ﬂexibly adjust their receptive ﬁeld to cope with nuisances in data and enhance expressivity of the representations.

Our systematic experiments and novel design choices lead to the following interesting ﬁndings:

Vi Ts demonstrate strong robustness against severe occlusions for foreground objects, non-salient

background regions and random patch locations, when compared with state-of-the-art CNNs. For instance, with a signiﬁcant random occlusion of up to 80%, Dei T [3] can maintain top-1 accuracy up to 60% where CNN has zero accuracy, on Image Net [8] val. set. When presented with texture and shape of the same object, CNN models often make decisions

based on texture [9]. In contrast, Vi Ts perform better than CNNs and comparable to humans on shape recognition. This highlights robustness of Vi Ts to deal with signiﬁcant distribution shifts e.g., recognizing object shapes in less textured data such as paintings. Compared to CNNs, Vi Ts show better robustness against other nuisance factors such as spatial

patch-level permutations, adversarial perturbations and common natural corruptions (e.g., noise, blur, contrast and pixelation artefacts). However, similar to CNNs [10], a shape-focused training process renders them vulnerable against adversarial attacks and common corruptions. Apart from their promising robustness properties, off-the-shelf Vi T features from Image Net

pretrained models generalize exceptionally well to new domains e.g., few-shot learning, ﬁnegrained recognition, scene categorization and long-tail classiﬁcation settings.

In addition to our extensive experimental analysis and new ﬁndings, we introduce several novel design choices to highlight the strong potential of Vi Ts. To this end, we propose an architectural modiﬁcation to Dei T to encode shape-information via a dedicated token that demonstrates how seemingly contradictory cues can be modeled with distinct tokens within the same architecture, leading to favorable implications such as automated segmentation without pixel-level supervision. Moreover, our off-the-shelf feature transfer approach utilizes an ensemble of representations derived from a single architecture to obtain state-of-the-art generalization with a pre-trained Vi T (Fig. 1).

2 Related Work

CNNs have shown state-of-the-art performance in independent and identically distributed (i.i.d) settings but remain highly sensitive to distributional shifts; adversarial noise [11, 12], common image corruptions [13], and domain shifts (e.g., RGB to sketches) [14]. It is natural to ask if Vi T, that processes inputs based on self-attention, offers any advantages in comparison to CNN. Shao et al. [15] analyze Vi Ts against adversarial noise and show Vi Ts are more robust to high frequency changes. Similarly, Bhojanapalli et al. [16] study Vi T against spatial perturbations [15] and its robustness to removal of any single layer. Since Vi T processes image patches, we focus on their robustness against patch masking, localized adversarial patches [17] and common natural corruptions. A concurrent work from Paul and Chen [18] also develops similar insights on robustness of Vi Ts but with a somewhat different set of experiments.

Geirhos et al. [9] provide evidence that CNNs mainly exploit texture to make a decision and give less importance to global shape. This is further conﬁrmed by CNN ability to only use local features

[19]. Recently, [20] quantiﬁes mutual information [21] between shape and texture features. Our analysis indicates that large Vi T models have less texture bias and give relatively higher emphasis to shape information. Vi T s shape-bias approaches human-level performance when directly trained on stylized Image Net [9]. Our ﬁndings are consistent with a concurrent recent work that demonstrates the importance of this trend on human behavioural understanding and bridging the gap between human and machine vision [22]. A recent work [23] shows that self-supervised Vi T can automatically segment foreground objects. In comparison, we show how shape-focused learning can impart similar capability in the image-level supervised Vi T models, without any pixel-level supervision.

Zeiler et al. [24] introduce a method to visualize CNN features at different layers and study the performance of off-the-shelf features. In a similar spirit, we study the generalization of off-the-shelf features of Vi T in comparison to CNN. Receptive ﬁeld is an indication of network s ability to model long range dependencies. The receptive ﬁeld of Transformer based models covers the entire input space, a property that resembles handcrafted features [25], but Vi Ts have higher representative capacity. This allows Vi T to model global context and preserve the structural information compared to CNN [26]. This work is an effort to demonstrate the effectiveness of ﬂexible receptive ﬁeld and content-based context modeling in Vi Ts towards robustness and generalization of the learned features.

3 Intriguing Properties of Vision Transformers

3.1 Are Vision Transformers Robust to Occlusions?

The receptive ﬁeld of a Vi T spans over the entire image and it models the interaction between the sequence of image patches using self-attention [26, 27]. We study whether Vi Ts perform robustly in occluded scenarios, where some or most of the image content is missing.

Occlusion Modeling: Consider a network f, that processes an input image x to predict a label y, where x is represented as a patch sequence with N elements, i.e., x = {xi}N

i=1 [2]. While there can be multiple ways to deﬁne occlusion, we adopt a simple masking strategy, where we select a subset of the total image patches, M < N, and set pixel values of these patches to zero to create an occluded image, x0. We refer to this approach as Patch Drop. The objective is then to observe robustness such that f(x0)argmax = y. We experiment with three variants of our occlusion approach, (a) Random Patch Drop, (b) Salient (foreground) Patch Drop, and (c) Non-salient (background) Patch Drop.

Random Patch Drop: A subset of M patches is randomly selected and dropped (Fig. 2). Several recent Vision Transformers [2, 3, 4] divide an image into 196 patches belonging to a 14x14 spatial grid; i.e. an image of size 224 224 3 is split into 196 patches, each of size 16 16 3. As an example, dropping 100 such patches from the input is equivalent to losing 51% of the image content.

Salient (foreground) Patch Drop: Not all pixels have the same importance for vision tasks. Thus, it is important to study the robustness of Vi Ts against occlusions of highly salient regions. We leverage a self-supervised Vi T model DINO [23] that is shown to effectively segment salient objects. In particular, the spatial positions of information ﬂowing into the ﬁnal feature vector (class token) within the last attention block are exploited to locate the salient pixels. This allows to control the amount of salient information captured within the selected pixels by thresholding the quantity of attention ﬂow.

We select the subset of patches containing the top Q% of foreground information (deterministic for ﬁxed Q) and drop them. Note that this Q% does not always correspond to the pixel percentage, e.g., 50% of the foreground information of an image may be contained within only 10% of its pixels.

Non-salient (background) Patch Drop: The least salient regions of the image are selected following the same approach as above, using [23]. The patches containing the lowest Q% of foreground information are selected and dropped here. Note this does not always correspond to the pixel percentage, e.g., 80% of the pixels may only contain 20% of the non-salient information for an image.

Figure 2: An example image with its occluded versions (Random, Salient and Non Salient). The occluded images are correctly classiﬁed by Deit-S [3] but misclassiﬁed by Res Net50 [28]. Pixel values in occluded (black) regions are set to zero.

Original Image

Random Patch Drop

Salient Patch Drop

Non-Salient Patch Drop

Figure 3: Robustness against object occlusion in images is studied under three Patch Drop settings (see Sec 3.1). (left) We study the robustness of CNN models to occlusions, and identify Res Net50 as a strong baseline. (mid-left) We compare the Dei T model family against Res Net50 exhibiting their superior robustness to object occlusion. (mid-right) Comparison against Vi T model family. (right) Comparison against T2T model family.

Robust Performance of Transformers Against Occlusions: We consider visual recognition task with models pretrained on Image Net [2]. The effect of occlusion is studied on the validation set (50k images). We deﬁne information loss (IL) as the ratio of dropped and total patches (M / N). IL is varied to obtain a range of occlusion levels for each Patch Drop methodology. The results (Top-1 %) reported in Fig. 3 show signiﬁcantly robust performance of Vi T models against CNNs. In the case of random Patch Drop, we report the mean of accuracy across 5 runs. For Salient and Non-Salient Patchdrop, we report the accuracy values over a single run, since the occlusion mask is deterministic. CNNs perform poorly when 50% of image information is randomly dropped. For example, Res Net50 (23 Million parameters) achieves 0.1% accuracy in comparison to Dei T-S (22 Million parameters) which obtains 70% accuracy when 50% of the image content is removed. An extreme example can be observed when 90% of the image information is randomly masked but Deit-B still exhibits 37% accuracy. This ﬁnding is consistent among different Vi T architectures [2, 3, 4]. Similarly, Vi Ts show signiﬁcant robustness to the foreground (salient) and background (non-salient) content removal. See Appendix A, B, C, D, and E for further results on robustness analysis.

Vi T Representations are Robust against Information Loss: In order to better understand model behavior against such occlusions, we visualize the attention (Fig. 4) from each head of different layers. While initial layers attend to all areas, deeper layers tend to focus more on the leftover information in non-occluded regions of an image. We then study if such changes from initial to deeper layers lead to token invariance against occlusion which is important for classiﬁcation. We measure the correlation coefﬁcient between features/tokens of original and occluded images by using corr(u, v) =

i ˆ ui ˆvi n , where ˆui = ui E[ui]

σ(ui) , E[ ] and σ( ) are mean and standard deviation operations [29]. In our case, random variables u and v refer to the feature maps for an original and occluded image deﬁned over the entire Image Net validation set. In the case of Res Net50, we consider features before the logit layer and for Vi T models, class tokens are extracted from the last transformer block. Class tokens from transformers are signiﬁcantly more robust and do not suffer much information loss as compared to Res Net50 features (Table 1). Furthermore, we visualize the correlation coefﬁcient across the 12 selected superclasses within Image Net hierarchy and note that the trend holds across different

class types, even for relatively small object types such as insects, food items and birds (Fig. 5). See Appendix F for attention visualizations and G for the qualitative results.

Given the intriguing robustness of transformer models due to dynamic receptive ﬁelds and discriminability preserving behaviour of the learned tokens, an ensuing question is whether the learned representations in Vi Ts are biased towards texture or not. One can expect a biased model focusing only on texture to still perform well when the spatial structure for an object is partially lost.

Figure 4: Attention maps (averaged over the entire Image Net val. set) relevant to each head in multiple layers of an Image Net pre-trained Dei T-B model. All images are occluded (Random Patch Drop) with the same mask (bottom right). Observe how later layers clearly attend to non-occluded regions of images to make a decision, an evidence of the model s highly dynamic receptive ﬁeld.

Model Correlation Coefﬁcient: Random Patch Drop 25% Dropped 50% Dropped 75% Dropped

Res Net50 0.32 0.16 0.13 0.11 0.07 0.09 Tn T-S 0.83 0.08 0.67 0.12 0.46 0.17 Vi T-L 0.92 0.06 0.81 0.13 0.50 0.21 Deit-B 0.90 0.06 0.77 0.10 0.56 0.15 T2T-24 0.80 0.10 0.60 0.15 0.31 0.17

Table 1: Correlation coefﬁcient b/w features/ﬁnal class tokens of original and occluded images for Random Patch Drop. Averaged across the Image Net val. set.

Figure 5: Correlation b/w features/ﬁnal tokens of original and occluded images for 50% Random Drop. Results are averaged across classes for each superclass.

3.2 Shape vs. Texture: Can Transformer Model Both Characteristics?

Geirhos et al. [9] study shape vs. texture hypothesis and propose a training framework to enhance shape-bias in CNNs. We ﬁrst carry out similar analysis and show that Vi T models preform with a shape-bias much stronger than that of a CNN, and comparably to the ability of human visual system in recognizing shapes. However, this approach results in a signiﬁcant drop in accuracy on the natural images. To address this issue, we introduce a shape token into the transformer architecture that learns to focus on shapes, thereby modeling both shape and texture related features within the same architecture using a distinct set of tokens. As such, we distill the shape information from a pretrained CNN model with high shape-bias [9]. Our distillation approach makes a balanced trade-off between high classiﬁcation accuracy and strong shape-bias compared to the original Vi T model.

We outline both approaches below. Note that the measure introduced in [9] is used to quantify shape-bias within Vi T models and compare against their CNN counterparts.

Training without Local Texture: In this approach, we ﬁrst remove local texture cues from the training data by creating a stylized version of Image Net [9] named SIN. We then train tiny and small Dei T models [3] on this dataset. Typically, Vi Ts use heavy data augmentations during training [3]. However, learning with SIN is a difﬁcult task due to less texture details and applying further augmentations on stylized samples distorts shape information and makes the training unstable. Thus, we train models on SIN without applying any augmentation, label smoothing or mixup.

We note that Vi Ts trained on Image Net exhibit higher shape-bias in comparison to similar capacity CNN models e.g., Dei T-S (22-Million params) performs better than Res Net50 (23-Million params) (Fig. 6, right plot). In contrast, the SIN trained Vi Ts consistently perform better than CNNs. Interestingly, Dei T-S [3] reaches human-level performance when trained on a SIN (Fig. 6, left plot).

Figure 6: Shape-bias Analysis: Shape-bias is deﬁned as the fraction of correct decisions based on object shape. (Left) Plot shows shape-texture tradeoff for CNN, Vi T

and Humans across different object classes. (Right) classmean shape-bias comparison. Overall, Vi Ts perform better than CNN. The shape bias increases signiﬁcantly when trained on stylized Image Net (SIN).

Model Distilled Token Type Image Net top-1 (%) Shape Bias

Dei T-T-SIN 7 cls 40.5 0.87 Dei T-T-SIN 3 cls 71.8 0.35 Dei T-T-SIN 3 shape 63.4 0.44

Dei T-S-SIN 7 cls 52.5 0.93 Dei T-S-SIN 3 cls 75.3 0.39 Dei T-S-SIN 3 shape 67.7 0.47

Table 2: Performance comparison of models trained on SIN. Vi T produces dynamic features that can be controlled by auxiliary tokens. cls represents the class token. During distillation cls and shape tokens converged to vastly different solution using the same features as compared to [3].

Shape Distillation: Knowledge distillation allows to compress large teacher models into smaller student models [30] as the teacher provides guidance to the student through soft labels. We introduce a new shape token and adapt attentive distillation [3] to distill shape knowledge from a CNN trained on the SIN dataset (Res Net50-SIN [9]). We observe that Vi T features are dynamic in nature and can be controlled by auxiliary tokens to focus on the desired characteristics. This means that a single Vi T model can exhibit both high shape and texture bias at the same time with separate tokens (Table 2). We achieve more balanced performance for classiﬁcation as well as shape-bias measure when the shape token is introduced (Fig. 7). To demonstrate that these distinct tokens (for classiﬁcation and shape) indeed model unique features, we compute cosine similarity (averaged over Image Net val. set) between class and shape tokens of our distilled models, Dei T-T-SIN and Dei T-S-SIN, which turns out to be 0.35 and 0.68, respectively. This is signiﬁcantly lower than the similarity between class and distillation tokens [3]; 0.96 and 0.94 for Dei T-T and Deit-S, respectively. This conﬁrms our hypothesis on modeling distinct features with separate tokens within Vi Ts, a unique capability that cannot be straightforwardly achieved with CNNs. Further, it offers other beneﬁts as we explain next.

Figure 7: Shape Distillation.

Shape-biased Vi T Offers Automated Object Segmentation: Interestingly, training without local texture or with shape distillation allows a Vi T to concentrate on foreground objects in the scene and ignore the background (Table 3, Fig. 8). This offers an automated semantic segmentation for an image although the model is never shown pixel-wise object labels. That is, shape-bias can be used as self-supervision signals for the Vi T model to learn distinct shape-related features that help localize the right foreground object. We note that a Vi T trained without emphasis on shape does not perform well (Table 3).

The above results show that properly trained Vi T models offer shape-bias nearly as high as the human s ability to recognize shapes. This leads us to question if positional encoding is the key that helps Vi Ts achieve high performance under severe occlusions (as it can potentially allow later layers to recover the missing information with just a few image patches given their spatial ordering). This possibility is examined next.

Model Distilled Token Type Jaccard Index

Dei T-T-Random 7 cls 19.6 Dei T-T 7 cls 32.2 Dei T-T-SIN 7 cls 29.4 Dei T-T-SIN 3 cls 40.0 Dei T-T-SIN 3 shape 42.2

Dei T-S-Random 7 cls 22.0 Dei T-S 7 cls 29.2 Dei T-S-SIN 7 cls 37.5 Dei T-S-SIN 3 cls 42.0 Dei T-S-SIN 3 shape 42.4

Table 3: We compute the Jaccard similarity between ground truth and masks generated from the attention maps of Vi T models (similar to [23] with threshold 0.9) over the PASCAL-VOC12 validation set. Only class level Image Net labels are used for training these models. Our results indicate that supervised Vi Ts can be used for automated segmentation and perform closer to the self-supervised method DINO [23].

Dei T-S-SIN

Dei T-S-SIN (Distilled) Figure 8: Segmentation maps from Vi Ts. Shape distillation performs better than standard supervised models.

3.3 Does Positional Encoding Preserve the Global Image Context?

Transformers ability to process long-range sequences in parallel using self-attention [27] (instead of a sequential design in RNN [31]) is invariant to sequence ordering. For images, the order of patches represents the overall image structure and global composition. Since Vi Ts operate on a sequence of images patches, changing the order of sequence e.g., shufﬂing the patches can destroy the image structure. Current Vi Ts [2, 3, 4, 26] use positional encoding to preserve this context. Here, we analyze if the sequence order modeled by positional encoding allows Vi T to excel under occlusion handling. Our analysis suggests that transformers show high permutation invariance to the patch positions, and the effect of positional encoding towards injecting structural information of images to Vi T models is limited (Fig. 10). This observation is consistent with the ﬁndings in the language domain [32] as described below.

Figure 9: An illustration of shufﬂe operation applied on images used to eliminate their structural information. (best viewed zoomed-in)

Sensitivity to Spatial Structure: We remove the structural information within images (spatial relationships) as illustrated in Fig. 9 by deﬁning a shufﬂing operation on input image patches. Fig. 10 shows that the Dei T models [3] retain accuracy better than their CNN counterparts when spatial structure of input images is disturbed. This also indicates that positional encoding is not absolutely crucial for right classiﬁcation decisions, and the model does not recover global image context using the patch sequence information preserved in the positional encodings. Without encoding, the Vi T performs reasonably well and achieves better permutation invariance than a Vi T using position

Figure 10: Models trained on 196 image patches. Top-1 (%) accuracy over Image Net val. set when patches are shufﬂed. Note the performance peaks when shufﬂe grid size is equal to the original number of patches used during training, since it equals to only changing the position of input patch (and not disturbing the patch content).

Figure 11: Dei T-T [3] trained on different number of image patches. Reducing patch size decreases the overall performance but also increases sensitivity to shufﬂe grid size.

Trained with Augmentations Trained without Augmentation

Dei T-B Dei T-S Dei T-T T2T-24 Tn T-S Augmix Res Net50 Res Net50-SIN Dei T-T-SIN Dei T-S-SIN

48.5 54.6 71.1 49.1 53.1 65.3 76.7 77.3 94.4 84.0

Table 4: mean Corruption Error (m CE) across common corruptions [13] (lower the better). While Vi Ts have better robustness compared to CNNs, training to achieve a higher shape-bias makes both CNNs and Vi Ts more vulnerable to natural distribution shifts. All models trained with augmentations (Vi T or CNN) have lower m CE in comparison to models trained without augmentations on Image Net or SIN.

Figure 12: Robustness against adversarial patch attack. Vi Ts even with less parameters exhibit a higher robustness than CNN. Models trained on Image Net are more robust than the ones trained on SIN. Results are averaged across ﬁve runs of patch attack over Image Net val. set.

Figure 13: Robustness against sample speciﬁc attacks including single step, FGSM [34], and multi-step, PGD [35]. Vi Ts even with less parameters exhibit a higher robustness than CNN. PGD ran for 5 iterations only. Attacks are evaluated under l1 norm and represents the perturbation budget by which each pixel is changed in the input image. Results are reported over the Image Net val. set.

encoding (Fig. 10). Finally, when the patch size is varied during Vi T training, the permutation invariance property is also degraded along with the accuracy on unshufﬂed natural images (Fig. 11). Overall, we attribute the permutation invariance performance of Vi Ts to their dynamic receptive ﬁeld that depends on the input patch and can adjust attention with the other sequence elements such that moderately shufﬂing the elements does not degrade the performance signiﬁcantly.

The above analysis shows that just like the texture-bias hypothesis does not apply to Vi Ts, the dependence on positional encodings to perform well under occlusions is also incorrect. This leads us to the conclude that Vi Ts robustness is due to its ﬂexible and dynamic receptive ﬁeld (see Fig. 4) which depends on the content of an input image. We now delve further deep into the robustness of Vi T, and study its performance under adversarial perturbations and common corruptions.

3.4 Robustness of Vision Transformers to Adversarial and Natural Perturbations

After analyzing the ability of Vi Ts to encode shape information (Sec. 3.2), one ensuing question is: Does higher shape-bias help achieve better robustness? In Table 4, we investigate this by calculating mean corruption error (m CE) [13] on a variety of synthetic common corruptions (e.g., rain, fog, snow and noise). A Vi T with similar parameters as CNN (e.g., Dei T-S) is more robust to image corruptions than Res Net50 trained with augmentations (Augmix [33]). Interestingly, CNNs and Vi Ts trained without augmentations on Image Net or SIN are more vulnerable to corruptions. These ﬁndings are consistent with [10], and suggest that augmentations improve robustness against common corruptions.

We observe similar performance against untargeted, universal adversarial patch attack [17] and sample speciﬁc attacks including single step, fast gradient sign method (FGSM) [34], and multi-step projected gradient attack known as PGD [35]. Adversarial patch attack [17] is unbounded that is it can change pixel values at certain location in the input image by any amount, while sample speciﬁc attacks [34, 35] are bounded by l1 norm with a perturbation budget , where represents the amount by which each pixel is changed in the entire image. Vi Ts and CNN trained on SIN are signiﬁcantly

Figure 14: A single Vi T model can provide a features ensemble since class token from each block can be processed by the classiﬁer independently. This allows us to identify the most discriminative tokens useful for transfer learning.

Figure 15: Top-1 (%) for Image Net val. set for class tokens produced by each Vi T block. Class tokens from the last few layers exhibit highest performance indicating the most discriminative tokens.

Blocks Class Patch CUB Flowers i Naturalist Tokens Tokens [37] [38] [39]

Only 12th (last block) 3 7 68.16 82.58 38.28 3 3 70.66 86.58 41.22

From 1st to 12th 3 7 72.90 91.38 44.03 3 3 73.16 91.27 43.33

From 9th to 12th 3 7 73.58 90.00 45.15 3 3 73.37 90.33 45.12

Table 5: Ablative Study for off-the-shelf feature transfer on three datasets using Image Net pretrained Dei T-S [3]. A linear classiﬁer is learned on only a concatenation of class tokens or the combination of class and averaged patch tokens at various blocks. We note class token from blocks 9-12 are most discriminative (Fig. 15) and have the highest transferability in terms of Top-1 (%) accuracy.

more vulnerable to adversarial attack than models trained on Image Net (Figs. 12 and 13), due to the shape-bias vs. robustness trade-off [10].

Given the strong robustness properties of Vi T as well as their representation capability in terms of shape-bias, automated segmentation and ﬂexible receptive ﬁeld, we analyze their utility as an off-the-shelf feature extractor to replace CNNs as the default feature extraction mechanism [36].

3.5 Effective Off-the-shelf Tokens for Vision Transformer

A unique characteristic of Vi T models is that each block within the model generates a class token which can be processed by the classiﬁcation head separately (Fig. 14). This allows us to measure the discriminative ability of each individual block of an Image Net pre-trained Vi T as shown in Fig. 15. Class tokens generated by the deeper blocks are more discriminative and we use this insight to

identify an effective ensemble of blocks whose tokens have the best downstream transferability.

Transfer Methodology: As illustrated in Fig. 15, we analyze the block-wise classiﬁcation accuracy of Dei T models and determine the discriminative information is captured within the class tokens of the last few blocks. As such, we conduct an ablation study for off-the-shelf transfer learning on ﬁne-grained classiﬁcation dataset CUB [37], Flowers [38] and large scale i Naturalist [39] using Dei T-S [3] as reported in Table 5. Here, we concatenate the class tokens (optionally combined with average patch tokens) from different blocks and train a linear classiﬁer to transfer the features to downstream tasks. Note that a patch token is generated by averaging along the patch dimension. The scheme that concatenate class tokens from the last four blocks shows the best transfer learning performance. We refer to this transfer methodology as Dei T-S (ensemble). Concatenation of both class and averaged patch tokens from all blocks helps achieve similar performance compared to the tokens from the last four blocks but requires signiﬁcantly large parameters to train. We ﬁnd some exception to this on the Flower dataset [38] where using class tokens from all blocks have relatively better improvement (only 1.2%), compared to the class tokens from the last four blocks (Table 5). However, concatenating tokens from all blocks also increases the number of parameters e.g., transfer to Flowers from all tokens has 3 times more learnable parameters than using only the last four tokens. We conduct further experimentation with Dei T-S (ensemble) across a broader range of tasks to validate our hypothesis. We further compare against a pre-trained Res Net50 baseline, by using features before the logit layer.

Visual Classification: We analyze the transferability of off-the-shelf features across several datasets including Aircraft [40], CUB [37], DTD [41], GTSRB [42], Fungi [43], Places365 [44] and i Naturalist [39]. These datasets are developed for ﬁne-grained recognition, texture classiﬁcation,

Figure 16: Off-the-shelf Vi T features transfer better than CNNs. We explore transferability of learned representations using generic classiﬁcation as well as few-shot classiﬁcation for out-of-domain tasks. In the case of classiﬁcation (left), the Image Net pre-trained Vi Ts transfer better than their CNN counterparts across tasks. In the case of few-shot learning (right), Image Net pre-trained Vi Ts perform better on average.

trafﬁc sign recognition, species classiﬁcation and scene recognition with 100, 200, 47, 43, 1394, 365 and 1010 classes respectively. We train a linear classiﬁer on top of the extracted features over the train split of each dataset, and evaluate the performance on their respective test splits. The Vi T features show clear improvements over the CNN baseline (Fig. 16). We note that Dei T-T, which requires about 5 times fewer parameters than Res Net50, performs better among all datasets. Furthermore, the model with the proposed ensemble strategy achieves the best results across all datasets.

Few-Shot Learning: We consider meta-dataset [45] designed as a large-scale few-shot learning (FSL) benchmark containing a diverse set of datasets from multiple domains. This includes letters of alphabets, hand-drawn sketches, images of textures, and ﬁne-grained classes making it a challenging dataset involving a domain adaption requirement as well. We follow the standard setting of training on Image Net and testing on all other datasets which are considered as the downstream tasks.

In our experiments, we use a network pre-trained for classiﬁcation on Image Net dataset to extract features. For each downstream dataset, under the FSL setting, a support set of labelled images is available for every test query. We use the extracted features to learn a linear classiﬁer over the support set for each query (similar to [46]), and evaluate using the standard FSL protocol deﬁned in [45]. This evaluation involves a varying number of shots speciﬁc for each downstream dataset. On average, the Vi T features transfer better across these diverse domains (Fig. 16) in comparison to the CNN baseline. Furthermore, we note that the transfer performance of Vi T is further boosted using the proposed ensemble strategy. We also highlight the improvement in Quick Draw, a dataset containing hand-drawn sketches, which aligns with our ﬁndings on improved shape-bias of Vi T models in contrast to CNN models (see Sec. 3.2 for elaborate discussion).

4 Discussion and Conclusions

In this paper, we analyze intriguing properties of Vi Ts in terms of robustness and generalizability. We test with a variety of Vi T models on ﬁfteen vision datasets. All the models are trained on 4 V100 GPUs. We demonstrate favorable merits of Vi Ts over CNNs for occlusion handling, robustness to distributional shifts and patch permutations, automatic segmentation without pixel supervision, and robustness against adversarial patches, sample speciﬁc adversarial attacks and common corruptions. Moreover, we demonstrate strong transferability of off-the-shelf Vi T features to a number of downstream tasks with the proposed feature ensemble from a single Vi T model. An interesting future research direction is to explore how the diverse range of cues modeled within a single Vi T using separate tokens can be effectively combined to complement each other. Similarly, we found that Vi Ts auto-segmentation property stems from their ability to encode shape information. We believe that integrating our approach and DINO [23] is worth exploring in the future. To highlight few open research questions: a) Can self-supervision on stylized Image Net (SIN) improve segmentation ability of DINO?, and b) Can a modiﬁed DINO training scheme with texture (IN) based local views and shape (SIN) based global views enhance (and generalize) its auto-segmentation capability?

Our current set of experiments are based on Image Net (ILSVRC 12) pre-trained Vi Ts, which pose the risk of reﬂecting potential biases in the learned representations. The data is mostly Western, and encodes several gender/ethnicity stereotypes with under-representation of certain groups [47]. This version of the Image Net also poses privacy risks, due to the unblurred human faces. In future, we will use a recent Image Net version which addresses the above issues [48].

Acknowledgments

This work is supported in part by NSF CAREER grant 1149783, and VR starting grant (2016-05543). M. Hayat is supported by Australian Research Council DECRA fellowship DE200101100.

[1] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak

Shah. Transformers in vision: A survey. ar Xiv preprint ar Xiv:2101.01169, 2021. 1

[2] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas

Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. 1, 3, 4, 7

[3] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé

Jégou. Training data-efﬁcient image transformers & distillation through attention. ar Xiv preprint ar Xiv:2012.12877, 2020. 1, 2, 3, 4, 5, 6, 7, 9, 17, 19

[4] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, and Shuicheng

Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. ar Xiv preprint ar Xiv:2101.11986, 2021. 1, 3, 4, 7

[5] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens.

Stand-alone self-attention in vision models. ar Xiv preprint ar Xiv:1906.05909, 2019. 2

[6] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local relation networks for image recognition. In

IEEE Conference on Computer Vision and Pattern Recognition, pages 3464 3473, 2019. 2

[7] Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, and Jonathon

Shlens. Scaling local self-attention for parameter efﬁcient visual backbones. ar Xiv preprint ar Xiv:2103.12731, 2021. 2

[8] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,

Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211 252, 2015. 2

[9] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland

Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. ar Xiv preprint ar Xiv:1811.12231, 2018. 2, 3, 5, 6, 18

[10] Chaithanya Kumar Mummadi, Ranjitha Subramaniam, Robin Hutmacher, Julien Vitay, Volker Fischer,

and Jan Hendrik Metzen. Does enhanced shape bias improve neural network robustness to common corruptions? In International Conference on Learning Representations, 2021. 2, 8, 9

[11] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and

Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. 2

[12] Muzammal Naseer, Salman H Khan, Harris Khan, Fahad Shahbaz Khan, and Fatih Porikli. Cross-domain

transferability of adversarial perturbations. Advances in Neural Information Processing Systems, 2019. 2

[13] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions

and perturbations. ar Xiv preprint ar Xiv:1903.12261, 2019. 2, 8

[14] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain

generalization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5542 5550, 2017. 2

[15] Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui Hsieh. On the adversarial robustness of

visual transformers. ar Xiv preprint ar Xiv:2103.15670, 2021. 2

[16] Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, and Andreas

Veit. Understanding robustness of transformers for image classiﬁcation. ar Xiv preprint ar Xiv:2103.14586, 2021. 2

[17] Tom B Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. Adversarial patch. ar Xiv

preprint ar Xiv:1712.09665, 2017. 2, 8

[18] Sayak Paul and Pin-Yu Chen. Vision transformers are robust learners. ar Xiv preprint ar Xiv:2105.07581,

[19] Wieland Brendel and Matthias Bethge. Approximating cnns with bag-of-local-features models works

surprisingly well on imagenet. ar Xiv preprint ar Xiv:1904.00760, 2019. 3

[20] Md Amirul Islam, Matthew Kowal, Patrick Esser, Sen Jia, Bjorn Ommer, Konstantinos G Derpanis, and Neil

Bruce. Shape or texture: Understanding discriminative features in cnns. ar Xiv preprint ar Xiv:2101.11604, 2021. 3

[21] David V Foster and Peter Grassberger. Lower bounds on mutual information. Physical Review E,

83(1):010101, 2011. 3

[22] Shikhar Tuli, Ishita Dasgupta, Erin Grant, and Thomas L. Grifﬁths. Are convolutional neural networks or

transformers more like human vision? ar Xiv preprint ar Xiv:2105.07197, 2021. 3

[23] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand

Joulin. Emerging properties in self-supervised vision transformers. ar Xiv preprint ar Xiv:2104.14294, 2021. 3, 7, 10

[24] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European

Conference on Computer Vision, pages 818 833. Springer, 2014. 3

[25] Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. Saliency detection via

graph-based manifold ranking. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3166 3173, 2013. 3

[26] Yuxin Mao, Jing Zhang, Zhexiong Wan, Yuchao Dai, Aixuan Li, Yunqiu Lv, Xinyu Tian, Deng-Ping Fan,

and Nick Barnes. Transformer transforms salient object detection and camouﬂaged object detection. ar Xiv preprint ar Xiv:2104.10127, 2021. 3, 7

[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz

Kaiser, and Illia Polosukhin. Attention is all you need. ar Xiv preprint ar Xiv:1706.03762, 2017. 3, 7

[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.

In IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. 3

[29] David Forsyth. Probability and statistics for computer science. Springer, 2018. 4

[30] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv

preprint ar Xiv:1503.02531, 2015. 6

[31] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735 1780,

[32] Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann Ney. Language modeling with deep transformers.

ar Xiv preprint ar Xiv:1905.04226, 2019. 7

[33] Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan.

Aug Mix: A simple data processing method to improve robustness and uncertainty. International Conference on Learning Representations, 2020. 8

[34] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.

In International Conference on Learning Representations, 2014. 8

[35] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. To-

wards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. 8

[36] A. Razavian, Hossein Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: An astounding

baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 512 519, 2014. 9

[37] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200.

Technical Report CNS-TR-2010-001, California Institute of Technology, 2010. 9

[38] M-E. Nilsback and A. Zisserman. Automated ﬂower classiﬁcation over a large number of classes. In

Indian Conference on Computer Vision, Graphics and Image Processing, 2008. 9

[39] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam,

Pietro Perona, and Serge Belongie. The inaturalist species classiﬁcation and detection dataset. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8769 8778, 2018. 9

[40] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classiﬁcation of aircraft.

Technical report, 2013. 9

[41] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In IEEE

Conference on Computer Vision and Pattern Recognition, pages 3606 3613, 2014. 9

[42] Sebastian Houben, Johannes Stallkamp, Jan Salmen, Marc Schlipsing, and Christian Igel. Detection of

trafﬁc signs in real-world images: The German Trafﬁc Sign Detection Benchmark. In International Joint Conference on Neural Networks, 2013. 9

[43] Brigit Schroeder and Yin Cui. Fgvcx fungi classiﬁcation challenge 2018. In github. com/ visipedia/

fgvcx_ fungi_ comp,2018 . 9

[44] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million

image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452 1464, 2017. 9

[45] Eleni Triantaﬁllou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, Ross Goroshin, Carles

Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. Meta-dataset: A dataset of datasets for learning to learn from few examples. http://arxiv.org/abs/1903.03096, abs/1903.03096, 2019. 10

[46] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot

image classiﬁcation: a good embedding is all you need? ar Xiv preprint ar Xiv:2003.11539, 2020. 10

[47] Kaiyu Yang, Klint Qinami, Li Fei-Fei, Jia Deng, and Olga Russakovsky. Towards fairer datasets: Filtering

and balancing the distribution of the people subtree in the imagenet hierarchy. In ACM Conference on Fairness, Accountability, and Transparency, pages 547 558, 2020. 10

[48] Kaiyu Yang, Jacqueline Yau, Li Fei-Fei, Jia Deng, and Olga Russakovsky. A study of face obfuscation in

imagenet. ar Xiv preprint ar Xiv:2103.06191, 2021. 10

[49] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin

transformer: Hierarchical vision transformer using shifted windows. IEEE International Conference on Computer Vision, 2021. 17

[50] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network

design spaces. In IEEE Conference on Computer Vision and Pattern Recognition, 2020. 17