Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

Jul 17, 2024

Edwin Arkel Rios, Min-Chun Hu, Bo-Cheng Lai

Figure 1 for Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

Figure 2 for Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

Figure 3 for Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

Figure 4 for Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

Share this with someone who'll enjoy it:

Abstract:Fine-grained recognition involves the classification of images from subordinate macro-categories, and it is challenging due to small inter-class differences. To overcome this, most methods perform discriminative feature selection enabled by a feature extraction backbone followed by a high-level feature refinement step. Recently, many studies have shown the potential behind vision transformers as a backbone for fine-grained recognition, but their usage of its attention mechanism to select discriminative tokens can be computationally expensive. In this work, we propose a novel and computationally inexpensive metric to identify discriminative regions in an image. We compare the similarity between the global representation of an image given by the CLS token, a learnable token used by transformers for classification, and the local representation of individual patches. We select the regions with the highest similarity to obtain crops, which are forwarded through the same transformer encoder. Finally, high-level features of the original and cropped representations are further refined together in order to make more robust predictions. Through extensive experimental evaluation we demonstrate the effectiveness of our proposed method, obtaining favorable results in terms of accuracy across a variety of datasets. Furthermore, our method achieves these results at a much lower computational cost compared to the alternatives. Code and checkpoints are available at: \url{https://github.com/arkel23/GLSim}.

* Main: 12 pages, 5 figures, 5 tables. Appendix: 9 pages, 9 figures, 10 tables. Total: 21 pages, 14 figures, 15 tables

View paper on

Share this with someone who'll enjoy it:

Title:Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

Paper and Code