Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yajie Liu

Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Apr 13, 2025

Yongchao Feng, Yajie Liu, Shuai Yang, Wenrui Cai, Jinqing Zhang, Qiqi Zhan, Ziyue Huang, Hongxi Yan, Qiao Wan, Chenguang Liu(+6 more)

Figure 1 for Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Figure 2 for Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Figure 3 for Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Figure 4 for Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Abstract:Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated. In this work, we present the systematic review of VLM-based detection and segmentation, view VLM as the foundational model and conduct comprehensive evaluations across multiple downstream tasks for the first time: 1) The evaluation spans eight detection scenarios (closed-set detection, domain adaptation, crowded objects, etc.) and eight segmentation scenarios (few-shot, open-world, small object, etc.), revealing distinct performance advantages and limitations of various VLM architectures across tasks. 2) As for detection tasks, we evaluate VLMs under three finetuning granularities: \textit{zero prediction}, \textit{visual fine-tuning}, and \textit{text prompt}, and further analyze how different finetuning strategies impact performance under varied task. 3) Based on empirical findings, we provide in-depth analysis of the correlations between task characteristics, model architectures, and training methodologies, offering insights for future VLM design. 4) We believe that this work shall be valuable to the pattern recognition experts working in the fields of computer vision, multimodal learning, and vision foundation models by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research. A project associated with this review and evaluation has been created at https://github.com/better-chao/perceptual_abilities_evaluation.

* A Review and Evaluation about Vision-Language Model for Object Detection and Segmentation

Via

Access Paper or Ask Questions

Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision

Mar 06, 2024

Yajie Liu, Pu Ge, Qingjie Liu, Di Huang

Abstract:Recently, learning open-vocabulary semantic segmentation from text supervision has achieved promising downstream performance. Nevertheless, current approaches encounter an alignment granularity gap owing to the absence of dense annotations, wherein they learn coarse image/region-text alignment during training yet perform group/pixel-level predictions at inference. Such discrepancy leads to suboptimal learning efficiency and inferior zero-shot segmentation results. In this paper, we introduce a Multi-Grained Cross-modal Alignment (MGCA) framework, which explicitly learns pixel-level alignment along with object- and region-level alignment to bridge the granularity gap without any dense annotations. Specifically, MGCA ingeniously constructs pseudo multi-granular semantic correspondences upon image-text pairs and collaborates with hard sampling strategies to facilitate fine-grained cross-modal contrastive learning. Further, we point out the defects of existing group and pixel prediction units in downstream segmentation and develop an adaptive semantic unit which effectively mitigates their dilemmas including under- and over-segmentation. Training solely on CC3M, our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.

* 17 pages, 8 figures

Via

Access Paper or Ask Questions

Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Dec 01, 2023

Yajie Liu, Pu Ge, Haoxiang Ma, Shichao Fan, Qingjie Liu, Di Huang, Yunhong Wang

Figure 1 for Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Figure 2 for Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Figure 3 for Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Figure 4 for Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Abstract:Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions. Despite the overwhelming progress, it still remains challenging for current approaches to perform well on cases with various text expressions or with unseen visual entities, limiting its further application. In this paper, we present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above. Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context, facilitating target capturing in the presence of linguistic style changes. Furthermore, we introduce a multi-modal fusion aggregation module with visual guidance from a powerful pretrained model to leverage spatial relations and pixel coherences to handle the incomplete target masks and false positive irregular clumps which often appear on unseen visual entities. Extensive experiments are conducted in the zero-shot cross-dataset settings and the proposed approach achieves consistent gains compared to the state-of-the-art, e.g., 4.15\%, 5.45\%, and 4.64\% mIoU increase on RefCOCO, RefCOCO+ and ReferIt respectively, demonstrating its effectiveness. Additionally, the results on GraspNet-RIS show that our approach also generalizes well to new scenarios with large domain shifts.

* 7 pages, 4 figures

Via

Access Paper or Ask Questions

Exploring the Interactions between Target Positive and Negative Information for Acoustic Echo Cancellation

Jul 26, 2023

Chang Han, Xinmeng Xu, Weiping Tu, Yuhong Yang, Yajie Liu

Figure 1 for Exploring the Interactions between Target Positive and Negative Information for Acoustic Echo Cancellation

Figure 2 for Exploring the Interactions between Target Positive and Negative Information for Acoustic Echo Cancellation

Figure 3 for Exploring the Interactions between Target Positive and Negative Information for Acoustic Echo Cancellation

Figure 4 for Exploring the Interactions between Target Positive and Negative Information for Acoustic Echo Cancellation

Abstract:Acoustic echo cancellation (AEC) aims to remove interference signals while leaving near-end speech least distorted. As the indistinguishable patterns between near-end speech and interference signals, near-end speech can't be separated completely, causing speech distortion and interference signals residual. We observe that besides target positive information, e.g., ground-truth speech and features, the target negative information, such as interference signals and features, helps make pattern of target speech and interference signals more discriminative. Therefore, we present a novel AEC model encoder-decoder architecture with the guidance of negative information termed as CMNet. A collaboration module (CM) is designed to establish the correlation between the target positive and negative information in a learnable manner via three blocks: target positive, target negative, and interactive block. Experimental results demonstrate our CMNet achieves superior performance than recent methods.

* Accepted at INTERSPEECH 2023

Via

Access Paper or Ask Questions

An Empirical Study on Multi-Domain Robust Semantic Segmentation

Dec 08, 2022

Yajie Liu, Pu Ge, Qingjie Liu, Shichao Fan, Yunhong Wang

Figure 1 for An Empirical Study on Multi-Domain Robust Semantic Segmentation

Figure 2 for An Empirical Study on Multi-Domain Robust Semantic Segmentation

Figure 3 for An Empirical Study on Multi-Domain Robust Semantic Segmentation

Figure 4 for An Empirical Study on Multi-Domain Robust Semantic Segmentation

Abstract:How to effectively leverage the plentiful existing datasets to train a robust and high-performance model is of great significance for many practical applications. However, a model trained on a naive merge of different datasets tends to obtain poor performance due to annotation conflicts and domain divergence.In this paper, we attempt to train a unified model that is expected to perform well across domains on several popularity segmentation datasets.We conduct a detailed analysis of the impact on model generalization from three aspects of data augmentation, training strategies, and model capacity.Based on the analysis, we propose a robust solution that is able to improve model generalization across domains.Our solution ranks 2nd on RVC 2022 semantic segmentation task, with a dataset only 1/3 size of the 1st model used.

Via

Access Paper or Ask Questions

Visual Boundary Knowledge Translation for Foreground Segmentation

Aug 01, 2021

Zunlei Feng, Lechao Cheng, Xinchao Wang, Xiang Wang, Yajie Liu, Xiangtong Du, Mingli Song

Figure 1 for Visual Boundary Knowledge Translation for Foreground Segmentation

Figure 2 for Visual Boundary Knowledge Translation for Foreground Segmentation

Figure 3 for Visual Boundary Knowledge Translation for Foreground Segmentation

Figure 4 for Visual Boundary Knowledge Translation for Foreground Segmentation

Abstract:When confronted with objects of unknown types in an image, humans can effortlessly and precisely tell their visual boundaries. This recognition mechanism and underlying generalization capability seem to contrast to state-of-the-art image segmentation networks that rely on large-scale category-aware annotated training samples. In this paper, we make an attempt towards building models that explicitly account for visual boundary knowledge, in hope to reduce the training effort on segmenting unseen categories. Specifically, we investigate a new task termed as Boundary Knowledge Translation (BKT). Given a set of fully labeled categories, BKT aims to translate the visual boundary knowledge learned from the labeled categories, to a set of novel categories, each of which is provided only a few labeled samples. To this end, we propose a Translation Segmentation Network (Trans-Net), which comprises a segmentation network and two boundary discriminators. The segmentation network, combined with a boundary-aware self-supervised mechanism, is devised to conduct foreground segmentation, while the two discriminators work together in an adversarial manner to ensure an accurate segmentation of the novel categories under light supervision. Exhaustive experiments demonstrate that, with only tens of labeled samples as guidance, Trans-Net achieves close results on par with fully supervised methods.

* Accepted by AAAI 2021

Via

Access Paper or Ask Questions