Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation

May 18, 2024

Yichen Yan, Xingjian He, Sihan Chen, Shichen Lu, Jing Liu

Figure 1 for Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation

Figure 2 for Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation

Figure 3 for Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation

Figure 4 for Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation

Share this with someone who'll enjoy it:

Abstract:Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation. Previous methods typically rely on single-modality features, such as vision or language features, to guide the multi-modal fusion process. However, this approach limits the interaction between vision and language, leading to a lack of fine-grained correlation between the language description and pixel-level details during the decoding process. In this paper, we introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence. This bi-directional vision-language guided approach produces higher-quality multi-modal features sent to the decoder, facilitating adaptive propagation of fine-grained semantic information from textual features to visual features. Experiments on RefCOCO, RefCOCO+, and G-Ref datasets with various backbones consistently show our approach outperforming state-of-the-art methods.

* 12 pages, 4 figures ICIC2024

View paper on

Share this with someone who'll enjoy it:

Title:Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation

Paper and Code