Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SimViT: Exploring a Simple Vision Transformer with sliding windows

Dec 24, 2021

Gang Li, Di Xu, Xing Cheng, Lingyu Si, Changwen Zheng

Figure 1 for SimViT: Exploring a Simple Vision Transformer with sliding windows

Figure 2 for SimViT: Exploring a Simple Vision Transformer with sliding windows

Figure 3 for SimViT: Exploring a Simple Vision Transformer with sliding windows

Figure 4 for SimViT: Exploring a Simple Vision Transformer with sliding windows

Share this with someone who'll enjoy it:

Abstract:Although vision Transformers have achieved excellent performance as backbone models in many vision tasks, most of them intend to capture global relations of all tokens in an image or a window, which disrupts the inherent spatial and local correlations between patches in 2D structure. In this paper, we introduce a simple vision Transformer named SimViT, to incorporate spatial structure and local information into the vision Transformers. Specifically, we introduce Multi-head Central Self-Attention(MCSA) instead of conventional Multi-head Self-Attention to capture highly local relations. The introduction of sliding windows facilitates the capture of spatial structure. Meanwhile, SimViT extracts multi-scale hierarchical features from different layers for dense prediction tasks. Extensive experiments show the SimViT is effective and efficient as a general-purpose backbone model for various image processing tasks. Especially, our SimViT-Micro only needs 3.3M parameters to achieve 71.1% top-1 accuracy on ImageNet-1k dataset, which is the smallest size vision Transformer model by now. Our code will be available in https://github.com/ucasligang/SimViT.

* 7 pages, 3 figures

View paper on

Share this with someone who'll enjoy it:

Title:SimViT: Exploring a Simple Vision Transformer with sliding windows

Paper and Code