Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuanyi Liu

Unlocking the Power of SAM 2 for Few-Shot Segmentation

May 21, 2025

Qianxiong Xu, Lanyun Zhu, Xuanyi Liu, Guosheng Lin, Cheng Long, Ziyue Li, Rui Zhao

Abstract:Few-Shot Segmentation (FSS) aims to learn class-agnostic segmentation on few classes to segment arbitrary classes, but at the risk of overfitting. To address this, some methods use the well-learned knowledge of foundation models (e.g., SAM) to simplify the learning process. Recently, SAM 2 has extended SAM by supporting video segmentation, whose class-agnostic matching ability is useful to FSS. A simple idea is to encode support foreground (FG) features as memory, with which query FG features are matched and fused. Unfortunately, the FG objects in different frames of SAM 2's video data are always the same identity, while those in FSS are different identities, i.e., the matching step is incompatible. Therefore, we design Pseudo Prompt Generator to encode pseudo query memory, matching with query features in a compatible way. However, the memories can never be as accurate as the real ones, i.e., they are likely to contain incomplete query FG, and some unexpected query background (BG) features, leading to wrong segmentation. Hence, we further design Iterative Memory Refinement to fuse more query FG features into the memory, and devise a Support-Calibrated Memory Attention to suppress the unexpected query BG features in memory. Extensive experiments have been conducted on PASCAL-5$^i$ and COCO-20$^i$ to validate the effectiveness of our design, e.g., the 1-shot mIoU can be 4.2% better than the best baseline.

* This paper is accepted by ICML'25

Via

Access Paper or Ask Questions

POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation

Apr 01, 2025

Lanyun Zhu, Tianrun Chen, Qianxiong Xu, Xuanyi Liu, Deyi Ji, Haiyang Wu, De Wen Soh, Jun Liu

Abstract:Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. POPEN includes a preference-based optimization method to finetune the LVLM, aligning it more closely with human preferences and thereby generating better text responses and segmentation results. Additionally, POPEN introduces a preference-based ensemble method for inference, which integrates multiple outputs from the LVLM using a preference-score-based attention mechanism for refinement. To better adapt to the segmentation task, we incorporate several task-specific designs in our POPEN framework, including a new approach for collecting segmentation preference data with a curriculum learning mechanism, and a novel preference optimization loss to refine the segmentation capability of the LVLM. Experiments demonstrate that our method achieves state-of-the-art performance in reasoning segmentation, exhibiting minimal hallucination in text responses and the highest segmentation accuracy compared to previous advanced methods like LISA and PixelLM. Project page is https://lanyunzhu.site/POPEN/

* CVPR2025

Via

Access Paper or Ask Questions

X as Supervision: Contending with Depth Ambiguity in Unsupervised Monocular 3D Pose Estimation

Nov 20, 2024

Yuchen Yang, Xuanyi Liu, Xing Gao, Zhihang Zhong, Xiao Sun

Abstract:Recent unsupervised methods for monocular 3D pose estimation have endeavored to reduce dependence on limited annotated 3D data, but most are solely formulated in 2D space, overlooking the inherent depth ambiguity issue. Due to the information loss in 3D-to-2D projection, multiple potential depths may exist, yet only some of them are plausible in human structure. To tackle depth ambiguity, we propose a novel unsupervised framework featuring a multi-hypothesis detector and multiple tailored pretext tasks. The detector extracts multiple hypotheses from a heatmap within a local window, effectively managing the multi-solution problem. Furthermore, the pretext tasks harness 3D human priors from the SMPL model to regularize the solution space of pose estimation, aligning it with the empirical distribution of 3D human structures. This regularization is partially achieved through a GCN-based discriminator within the discriminative learning, and is further complemented with synthetic images through rendering, ensuring plausible estimations. Consequently, our approach demonstrates state-of-the-art unsupervised 3D pose estimation performance on various human datasets. Further evaluations on data scale-up and one animal dataset highlight its generalization capabilities. Code will be available at https://github.com/Charrrrrlie/X-as-Supervision.

Via

Access Paper or Ask Questions

Hybrid Mamba for Few-Shot Segmentation

Sep 29, 2024

Qianxiong Xu, Xuanyi Liu, Lanyun Zhu, Guosheng Lin, Cheng Long, Ziyue Li, Rui Zhao

Figure 1 for Hybrid Mamba for Few-Shot Segmentation

Figure 2 for Hybrid Mamba for Few-Shot Segmentation

Figure 3 for Hybrid Mamba for Few-Shot Segmentation

Figure 4 for Hybrid Mamba for Few-Shot Segmentation

Abstract:Many few-shot segmentation (FSS) methods use cross attention to fuse support foreground (FG) into query features, regardless of the quadratic complexity. A recent advance Mamba can also well capture intra-sequence dependencies, yet the complexity is only linear. Hence, we aim to devise a cross (attention-like) Mamba to capture inter-sequence dependencies for FSS. A simple idea is to scan on support features to selectively compress them into the hidden state, which is then used as the initial hidden state to sequentially scan query features. Nevertheless, it suffers from (1) support forgetting issue: query features will also gradually be compressed when scanning on them, so the support features in hidden state keep reducing, and many query pixels cannot fuse sufficient support features; (2) intra-class gap issue: query FG is essentially more similar to itself rather than to support FG, i.e., query may prefer not to fuse support features but their own ones from the hidden state, yet the success of FSS relies on the effective use of support information. To tackle them, we design a hybrid Mamba network (HMNet), including (1) a support recapped Mamba to periodically recap the support features when scanning query, so the hidden state can always contain rich support information; (2) a query intercepted Mamba to forbid the mutual interactions among query pixels, and encourage them to fuse more support features from the hidden state. Consequently, the support information is better utilized, leading to better performance. Extensive experiments have been conducted on two public benchmarks, showing the superiority of HMNet. The code is available at https://github.com/Sam1224/HMNet.

* This paper is accepted by NIPS'24

Via

Access Paper or Ask Questions

Proposal-Level Unsupervised Domain Adaptation for Open World Unbiased Detector

Nov 04, 2023

Xuanyi Liu, Zhongqi Yue, Xian-Sheng Hua

Figure 1 for Proposal-Level Unsupervised Domain Adaptation for Open World Unbiased Detector

Figure 2 for Proposal-Level Unsupervised Domain Adaptation for Open World Unbiased Detector

Figure 3 for Proposal-Level Unsupervised Domain Adaptation for Open World Unbiased Detector

Figure 4 for Proposal-Level Unsupervised Domain Adaptation for Open World Unbiased Detector

Abstract:Open World Object Detection (OWOD) combines open-set object detection with incremental learning capabilities to handle the challenge of the open and dynamic visual world. Existing works assume that a foreground predictor trained on the seen categories can be directly transferred to identify the unseen categories' locations by selecting the top-k most confident foreground predictions. However, the assumption is hardly valid in practice. This is because the predictor is inevitably biased to the known categories, and fails under the shift in the appearance of the unseen categories. In this work, we aim to build an unbiased foreground predictor by re-formulating the task under Unsupervised Domain Adaptation, where the current biased predictor helps form the domains: the seen object locations and confident background locations as the source domain, and the rest ambiguous ones as the target domain. Then, we adopt the simple and effective self-training method to learn a predictor based on the domain-invariant foreground features, hence achieving unbiased prediction robust to the shift in appearance between the seen and unseen categories. Our approach's pipeline can adapt to various detection frameworks and UDA methods, empirically validated by OWOD evaluation, where we achieve state-of-the-art performance.

Via

Access Paper or Ask Questions

Ultrafast Parallel LiDAR with Time-encoding and Spectral Scanning: Breaking the Time-of-flight Limit

Mar 09, 2021

Zihan Zang, Zhi Li, Yi Luo, Yanjun Han, Xuanyi Liu, H. Y. Fu

Figure 1 for Ultrafast Parallel LiDAR with Time-encoding and Spectral Scanning: Breaking the Time-of-flight Limit

Figure 2 for Ultrafast Parallel LiDAR with Time-encoding and Spectral Scanning: Breaking the Time-of-flight Limit

Figure 3 for Ultrafast Parallel LiDAR with Time-encoding and Spectral Scanning: Breaking the Time-of-flight Limit

Figure 4 for Ultrafast Parallel LiDAR with Time-encoding and Spectral Scanning: Breaking the Time-of-flight Limit

Abstract:Light detection and ranging (LiDAR) has been widely used in autonomous driving and large-scale manufacturing. Although state-of-the-art scanning LiDAR can perform long-range three-dimensional imaging, the frame rate is limited by both round-trip delay and the beam steering speed, hindering the development of high-speed autonomous vehicles. For hundred-meter level ranging applications, a several-time speedup is highly desirable. Here, we uniquely combine fiber-based encoders with wavelength-division multiplexing devices to implement all-optical time-encoding on the illumination light. Using this method, parallel detection and fast inertia-free spectral scanning can be achieved simultaneously with single-pixel detection. As a result, the frame rate of a scanning LiDAR can be multiplied with scalability. We demonstrate a 4.4-fold speedup for a maximum 75-m detection range, compared with a time-of-flight-limited laser ranging system. This approach has the potential to improve the velocity of LiDAR-based autonomous vehicles to the regime of hundred kilometers per hour and open up a new paradigm for ultrafast-frame-rate LiDAR imaging.

Via

Access Paper or Ask Questions

Distance Guided Channel Weighting for Semantic Segmentation

May 05, 2020

Lanyun Zhu, Shiping Zhu, Xuanyi Liu, Li Luo

Figure 1 for Distance Guided Channel Weighting for Semantic Segmentation

Figure 2 for Distance Guided Channel Weighting for Semantic Segmentation

Figure 3 for Distance Guided Channel Weighting for Semantic Segmentation

Figure 4 for Distance Guided Channel Weighting for Semantic Segmentation

Abstract:Recent works have achieved great success in improving the performance of multiple computer vision tasks by capturing features with a high channel number utilizing deep neural networks. However, many channels of extracted features are not discriminative and contain a lot of redundant information. In this paper, we address above issue by introducing the Distance Guided Channel Weighting (DGCW) Module. The DGCW module is constructed in a pixel-wise context extraction manner, which enhances the discriminativeness of features by weighting different channels of each pixel's feature vector when modeling its relationship with other pixels. It can make full use of the high-discriminative information while ignore the low-discriminative information containing in feature maps, as well as capture the long-range dependencies. Furthermore, by incorporating the DGCW module with a baseline segmentation network, we propose the Distance Guided Channel Weighting Network (DGCWNet). We conduct extensive experiments to demonstrate the effectiveness of DGCWNet. In particular, it achieves 81.6% mIoU on Cityscapes with only fine annotated data for training, and also gains satisfactory performance on another two semantic segmentation datasets, i.e. Pascal Context and ADE20K. Code will be available soon at https://github.com/LanyunZhu/DGCWNet.

Via

Access Paper or Ask Questions