Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Jan 31, 2024

Rozhan Ahmadi, Shohreh Kasaei

Figure 1 for Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Figure 2 for Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Figure 3 for Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Figure 4 for Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Share this with someone who'll enjoy it:

Abstract:In recent years, weakly supervised semantic segmentation using image-level labels as supervision has received significant attention in the field of computer vision. Most existing methods have addressed the challenges arising from the lack of spatial information in these labels by focusing on facilitating supervised learning through the generation of pseudo-labels from class activation maps (CAMs). Due to the localized pattern detection of Convolutional Neural Networks (CNNs), CAMs often emphasize only the most discriminative parts of an object, making it challenging to accurately distinguish foreground objects from each other and the background. Recent studies have shown that Vision Transformer (ViT) features, due to their global view, are more effective in capturing the scene layout than CNNs. However, the use of hierarchical ViTs has not been extensively explored in this field. This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs by bringing local and global views together. SWTformer-V1 generates class probabilities and CAMs using only the patch tokens as features. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information and utilizes a background-aware mechanism to generate more accurate localization maps with improved cross-object discrimination. Based on experiments on the PascalVOC 2012 dataset, SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. It also yields comparable performance by 0.82% mIoU on average higher than other methods in generating initial localization maps, depending only on the classification network. SWTformer-V2 further improves the accuracy of the generated seed CAMs by 5.32% mIoU, further proving the effectiveness of the local-to-global view provided by the Swin transformer.

* 7 pages, 4 figures, 3 tables

View paper on

Share this with someone who'll enjoy it:

Title:Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Paper and Code