Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Dec 06, 2024

Jaihyun Lew, Soohyuk Jang, Jaehoon Lee, Seungryong Yoo, Eunji Kim, Saehyung Lee, Jisoo Mok, Siwon Kim, Sungroh Yoon

Figure 1 for Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Figure 2 for Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Figure 3 for Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Figure 4 for Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Share this with someone who'll enjoy it:

Abstract:Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Transformer (ViT) utilizes tokens from uniformly partitioned square image patches, which may result in an arbitrary mixing of visual concepts in a token. In this work, we propose to substitute the grid-based tokenization in ViT with superpixel tokenization, which employs superpixels to generate a token that encapsulates a sole visual concept. Unfortunately, the diverse shapes, sizes, and locations of superpixels make integrating superpixels into ViT tokenization rather challenging. Our tokenization pipeline, comprised of pre-aggregate extraction and superpixel-aware aggregation, overcomes the challenges that arise in superpixel tokenization. Extensive experiments demonstrate that our approach, which exhibits strong compatibility with existing frameworks, enhances the accuracy and robustness of ViT on various downstream tasks.

View paper on

Share this with someone who'll enjoy it:

Title:Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Paper and Code