
Abstract
ViT’s multi-head self-attention flexibly references sequences of image patches. An important question is how this flexibility can be utilized to exploit nuisances in natural images. We have conducted a number of experiments to investigate the properties of ViTs compared to CNNs.
(a) The transform is robust to severe occlusion, perturbation, and domain shift. For example, it achieved top-1 accuracy of 60% even with 80% of the image removed by occlusion.

(b) (a) was not due to a texture bias, but rather because ViT was less biased toward local texture. When trained well to encode shape-based features, ViT was able to recognize shapes to a degree similar to human abilities, which has not been shown in previous studies. (c) Using ViT to encode the shape representation, we were able to achieve accurate semantic segmentation without pixel-level supervision. (d) The use of off-the-shelf features in one ViT model could be used to create other feature ensembles, achieving higher accuracy. We found that ViT’s flexible and dynamic receptive field is an effective feature of ViT.
Intriguing Properties of Vision Transformer
Are Vision Transformer Robust to Occlusions
Occlusion Modeling :
Given an image x and a label y, the image x is represented by a sequence of N patches. We chose a method (called PatchDrop in the paper) that picks M image patches out of these N and replaces them with 0 to create x’. We applied this PatchDrop in the following three ways

Robust Performance of Transformer Against Occlusions
- Training was done with ImageNet to solve classification problems and evaluated by the accuracy of the validation set.
- Information Loss: IL defined as the percentage of dropped patches out of all patches (= M/N)
- The graph below shows that ViT is much more robust than CNN.

ViT Representations are Robust against Information Loss
To better understand the model’s response to occlusion, we visualized the attentions of each head in different layers. In the early layers, they attend all regions, but as we go deeper, we can see that they focus on the unoccluded regions of the image.

We want to check if there is token invariance for the above changes as the layer gets deeper.
We calculated the correlation coefficient between features (or tokens) for the original and occluded images. For ResNet50, we used the features before the logit layer, and for ViT, we took the class token of the last transformer block. Compared to ResNet, the class tokens in ViT were more robust (=higher correlation). This behavior was also true for other datasets with relatively small objects.
