Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

Oct 27, 2022

Chaofan Ma, Yuhuan Yang, Yanfeng Wang, Ya Zhang, Weidi Xie

Figure 1 for Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

Figure 2 for Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

Figure 3 for Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

Figure 4 for Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

Share this with someone who'll enjoy it:

Abstract:When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks. In this paper, we investigate simple, yet effective approaches for adapting the pre-trained foundation models to the downstream task of interest, namely, open-vocabulary semantic segmentation. To this end, we make the following contributions: (i) we introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept through a handful of image segmentation data. As a consequence, the model gains the capability of zero-shot transfer to segment novel categories; (ii) without loss of generality, we experiment on a broad range of self-supervised models that have been pre-trained with different schemes, e.g. visual-only models (MoCo v3, DINO), language-only models (BERT), visual-language model (CLIP), and show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data; (iii) we conduct thorough ablation studies to analyze the critical components in our proposed Fusioner, while evaluating on standard benchmarks, e.g. PASCAL-5i and COCO-20i , it surpasses existing state-of-the-art models by a large margin, despite only being trained on frozen visual and language features; (iv) to measure the model's robustness on learning visual-language correspondence, we further evaluate on synthetic dataset, named Mosaic-4, where images are constructed by mosaicking the samples from FSS-1000. Fusioner demonstrates superior performance over previous models.

* BMVC 2022 Oral

View paper on

Share this with someone who'll enjoy it:

Title:Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

Paper and Code