Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:BOAT: Bilateral Local Attention Vision Transformer

Jan 31, 2022

Tan Yu, Gangming Zhao, Ping Li, Yizhou Yu

Figure 1 for BOAT: Bilateral Local Attention Vision Transformer

Figure 2 for BOAT: Bilateral Local Attention Vision Transformer

Figure 3 for BOAT: Bilateral Local Attention Vision Transformer

Figure 4 for BOAT: Bilateral Local Attention Vision Transformer

Share this with someone who'll enjoy it:

Abstract:Vision Transformers achieved outstanding performance in many computer vision tasks. Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large. To improve efficiency, recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows. Despite the fact that window-based local self-attention significantly boosts efficiency, it fails to capture the relationships between distant but similar patches in the image plane. To overcome this limitation of image-space local attention, in this paper, we further exploit the locality of patches in the feature space. We group the patches into multiple clusters using their features, and self-attention is computed within every cluster. Such feature-space local attention effectively captures the connections between patches across different local windows but still relevant. We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention. We further integrate BOAT with both Swin and CSWin models, and extensive experiments on several benchmark datasets demonstrate that our BOAT-CSWin model clearly and consistently outperforms existing state-of-the-art CNN models and vision Transformers.

View paper on

Share this with someone who'll enjoy it:

Title:BOAT: Bilateral Local Attention Vision Transformer

Paper and Code