Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Di Ming

ViT-P: Rethinking Data-efficient Vision Transformers from Locality

Mar 04, 2022

Bin Chen, Ran Wang, Di Ming, Xin Feng

Figure 1 for ViT-P: Rethinking Data-efficient Vision Transformers from Locality

Figure 2 for ViT-P: Rethinking Data-efficient Vision Transformers from Locality

Figure 3 for ViT-P: Rethinking Data-efficient Vision Transformers from Locality

Figure 4 for ViT-P: Rethinking Data-efficient Vision Transformers from Locality

Abstract:Recent advances of Transformers have brought new trust to computer vision tasks. However, on small dataset, Transformers is hard to train and has lower performance than convolutional neural networks. We make vision transformers as data-efficient as convolutional neural networks by introducing multi-focal attention bias. Inspired by the attention distance in a well-trained ViT, we constrain the self-attention of ViT to have multi-scale localized receptive field. The size of receptive field is adaptable during training so that optimal configuration can be learned. We provide empirical evidence that proper constrain of receptive field can reduce the amount of training data for vision transformers. On Cifar100, our ViT-P Base model achieves the state-of-the-art accuracy (83.16%) trained from scratch. We also perform analysis on ImageNet to show our method does not lose accuracy on large data sets.

Via

Access Paper or Ask Questions