Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Mar 24, 2022

Zi-Chao Zhang, Zhen-Duo Chen, Yongxin Wang, Xin Luo, Xin-Shun Xu

Figure 1 for ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Figure 2 for ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Figure 3 for ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Figure 4 for ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Share this with someone who'll enjoy it:

Abstract:Recently, several Vision Transformer (ViT) based methods have been proposed for Fine-Grained Visual Classification (FGVC).These methods significantly surpass existing CNN-based ones, demonstrating the effectiveness of ViT in FGVC tasks.However, there are some limitations when applying ViT directly to FGVC.First, ViT needs to split images into patches and calculate the attention of every pair, which may result in heavy redundant calculation and unsatisfying performance when handling fine-grained images with complex background and small objects.Second, a standard ViT only utilizes the class token in the final layer for classification, which is not enough to extract comprehensive fine-grained information. To address these issues, we propose a novel ViT based fine-grained object discriminator for FGVC tasks, ViT-FOD for short. Specifically, besides a ViT backbone, it further introduces three novel components, i.e, Attention Patch Combination (APC), Critical Regions Filter (CRF), and Complementary Tokens Integration (CTI). Thereinto, APC pieces informative patches from two images to generate a new image so that the redundant calculation can be reduced. CRF emphasizes tokens corresponding to discriminative regions to generate a new class token for subtle feature learning. To extract comprehensive information, CTI integrates complementary information captured by class tokens in different ViT layers. We conduct comprehensive experiments on widely used datasets and the results demonstrate that ViT-FOD is able to achieve state-of-the-art performance.

View paper on

Share this with someone who'll enjoy it:

Title:ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Paper and Code