Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haiyun Guo

UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval

Aug 06, 2025

Hongyu Guo, Kuan Zhu, Xiangzhao Hao, Haiyun Guo, Ming Tang, Jinqiao Wang

Abstract:Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.

Via

Access Paper or Ask Questions

Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation

May 22, 2025

Zhenglin Hua, Jinghan He, Zijun Yao, Tianxu Han, Haiyun Guo, Yuheng Jia, Junfeng Fang

Abstract:Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks such as visual question answering (VQA) and image captioning. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with either hallucinations or actuality, realizing more precise and direct hallucination-related representations. Our analysis demonstrates that interventions along the faithful direction we identified can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose Steering LVLMs via SAE Latent Directions (SSL), a training-free method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead.

Via

Access Paper or Ask Questions

PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability

Mar 13, 2025

Weijie Zhou, Manli Tao, Chaoyang Zhao, Haiyun Guo, Honghui Dong, Ming Tang, Jinqiao Wang

Abstract:Understanding the environment and a robot's physical reachability is crucial for task execution. While state-of-the-art vision-language models (VLMs) excel in environmental perception, they often generate inaccurate or impractical responses in embodied visual reasoning tasks due to a lack of understanding of robotic physical reachability. To address this issue, we propose a unified representation of physical reachability across diverse robots, i.e., Space-Physical Reachability Map (S-P Map), and PhysVLM, a vision-language model that integrates this reachability information into visual reasoning. Specifically, the S-P Map abstracts a robot's physical reachability into a generalized spatial representation, independent of specific robot configurations, allowing the model to focus on reachability features rather than robot-specific parameters. Subsequently, PhysVLM extends traditional VLM architectures by incorporating an additional feature encoder to process the S-P Map, enabling the model to reason about physical reachability without compromising its general vision-language capabilities. To train and evaluate PhysVLM, we constructed a large-scale multi-robot dataset, Phys100K, and a challenging benchmark, EQA-phys, which includes tasks for six different robots in both simulated and real-world environments. Experimental results demonstrate that PhysVLM outperforms existing models, achieving a 14\% improvement over GPT-4o on EQA-phys and surpassing advanced embodied VLMs such as RoboMamba and SpatialVLM on the RoboVQA-val and OpenEQA benchmarks. Additionally, the S-P Map shows strong compatibility with various VLMs, and its integration into GPT-4o-mini yields a 7.1\% performance improvement.

Via

Access Paper or Ask Questions

Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence

Dec 18, 2024

Jinghan He, Kuan Zhu, Haiyun Guo, Junfeng Fang, Zhenglin Hua, Yuheng Jia, Ming Tang, Tat-Seng Chua, Jinqiao Wang

Figure 1 for Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence

Figure 2 for Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence

Figure 3 for Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence

Figure 4 for Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence

Abstract:Large vision-language models (LVLMs) have made substantial progress in integrating large language models (LLMs) with visual inputs, enabling advanced multimodal reasoning. Despite their success, a persistent challenge is hallucination-where generated text fails to accurately reflect visual content-undermining both accuracy and reliability. Existing methods focus on alignment training or decoding refinements but primarily address symptoms at the generation stage without probing the underlying causes. In this work, we investigate the internal mechanisms driving hallucination in LVLMs, with an emphasis on the multi-head attention module. Specifically, we introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context. Based on this, our findings reveal the presence of vision-aware attention heads that are more attuned to visual information; however, the model's overreliance on its prior language patterns is closely related to hallucinations. Building on these insights, we propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art approaches in mitigating hallucinations, while maintaining high efficiency with negligible additional time overhead.

Via

Access Paper or Ask Questions

Monocular Lane Detection Based on Deep Learning: A Survey

Nov 26, 2024

Xin He, Haiyun Guo, Kuan Zhu, Bingke Zhu, Xu Zhao, Jianwu Fang, Jinqiao Wang

Figure 1 for Monocular Lane Detection Based on Deep Learning: A Survey

Figure 2 for Monocular Lane Detection Based on Deep Learning: A Survey

Figure 3 for Monocular Lane Detection Based on Deep Learning: A Survey

Figure 4 for Monocular Lane Detection Based on Deep Learning: A Survey

Abstract:Lane detection plays an important role in autonomous driving perception systems. As deep learning algorithms gain popularity, monocular lane detection methods based on deep learning have demonstrated superior performance and emerged as a key research direction in autonomous driving perception. The core design of these algorithmic frameworks can be summarized as follows: (1) Task paradigm, focusing on lane instance-level discrimination; (2) Lane modeling, representing lanes as a set of learnable parameters in the neural network; (3) Global context supplementation, enhancing the detection of obscure lanes; (4) Perspective effect elimination, providing 3D lanes usable for downstream applications. From these perspectives, this paper presents a comprehensive overview of existing methods, encompassing both the increasingly mature 2D lane detection approaches and the developing 3D lane detection works. For a relatively fair comparison, in addition to comparing the performance of mainstream methods on different benchmarks, their inference speed is also investigated under a unified setting. Moreover, we present some extended works on lane detection, including multi-task perception, video lane detection, online high-definition map construction, and lane topology reasoning, to offer readers a comprehensive roadmap for the evolution of lane detection. Finally, we point out some potential future research directions in this field. We exhaustively collect the papers and codes of existing works at https://github.com/Core9724/Awesome-Lane-Detection and will keep tracing the research.

Via

Access Paper or Ask Questions

SEEKR: Selective Attention-Guided Knowledge Retention for Continual Learning of Large Language Models

Nov 09, 2024

Jinghan He, Haiyun Guo, Kuan Zhu, Zihan Zhao, Ming Tang, Jinqiao Wang

Abstract:Continual learning (CL) is crucial for language models to dynamically adapt to the evolving real-world demands. To mitigate the catastrophic forgetting problem in CL, data replay has been proven a simple and effective strategy, and the subsequent data-replay-based distillation can further enhance the performance. However, existing methods fail to fully exploit the knowledge embedded in models from previous tasks, resulting in the need for a relatively large number of replay samples to achieve good results. In this work, we first explore and emphasize the importance of attention weights in knowledge retention, and then propose a SElective attEntion-guided Knowledge Retention method (SEEKR) for data-efficient replay-based continual learning of large language models (LLMs). Specifically, SEEKR performs attention distillation on the selected attention heads for finer-grained knowledge retention, where the proposed forgettability-based and task-sensitivity-based measures are used to identify the most valuable attention heads. Experimental results on two continual learning benchmarks for LLMs demonstrate the superiority of SEEKR over the existing methods on both performance and efficiency. Explicitly, SEEKR achieves comparable or even better performance with only 1/10 of the replayed data used by other methods, and reduces the proportion of replayed data to 1%.

* EMNLP2024

Via

Access Paper or Ask Questions

WaveMo: Learning Wavefront Modulations to See Through Scattering

Apr 11, 2024

Mingyang Xie, Haiyun Guo, Brandon Y. Feng, Lingbo Jin, Ashok Veeraraghavan, Christopher A. Metzler

Figure 1 for WaveMo: Learning Wavefront Modulations to See Through Scattering

Figure 2 for WaveMo: Learning Wavefront Modulations to See Through Scattering

Figure 3 for WaveMo: Learning Wavefront Modulations to See Through Scattering

Figure 4 for WaveMo: Learning Wavefront Modulations to See Through Scattering

Abstract:Imaging through scattering media is a fundamental and pervasive challenge in fields ranging from medical diagnostics to astronomy. A promising strategy to overcome this challenge is wavefront modulation, which induces measurement diversity during image acquisition. Despite its importance, designing optimal wavefront modulations to image through scattering remains under-explored. This paper introduces a novel learning-based framework to address the gap. Our approach jointly optimizes wavefront modulations and a computationally lightweight feedforward "proxy" reconstruction network. This network is trained to recover scenes obscured by scattering, using measurements that are modified by these modulations. The learned modulations produced by our framework generalize effectively to unseen scattering scenarios and exhibit remarkable versatility. During deployment, the learned modulations can be decoupled from the proxy network to augment other more computationally expensive restoration algorithms. Through extensive experiments, we demonstrate our approach significantly advances the state of the art in imaging through scattering media. Our project webpage is at https://wavemo-2024.github.io/.

Via

Access Paper or Ask Questions

Continual Instruction Tuning for Large Multimodal Models

Nov 27, 2023

Jinghan He, Haiyun Guo, Ming Tang, Jinqiao Wang

Figure 1 for Continual Instruction Tuning for Large Multimodal Models

Figure 2 for Continual Instruction Tuning for Large Multimodal Models

Figure 3 for Continual Instruction Tuning for Large Multimodal Models

Figure 4 for Continual Instruction Tuning for Large Multimodal Models

Abstract:Instruction tuning is now a widely adopted approach to aligning large multimodal models (LMMs) to follow human intent. It unifies the data format of vision-language tasks, enabling multi-task joint training. However, vision-language tasks are constantly being created in practice. Instead of always re-training LMMs when new tasks arrive, continual learning offers flexibility for models to continually and efficiently exploit the evolving data. This work aims to explore the following two questions: 1) Do LMMs still suffer from catastrophic forgetting in continual instruction tuning? 2) Are the existing three classes of continual learning methods still applicable to the continual instruction tuning of LMMs? An extensive study is conducted to address the above questions. First, we establish the first benchmark in this setting and reveal that catastrophic forgetting is still observed when continually instruction-tuning LMMs. However, the multi-task joint instruction tuning can facilitate the model's continual learning ability and mitigate forgetting. Second, we integrate and adapt classic continual learning methods to our context, demonstrating the efficacy of data replay and model expansion strategies across diverse scenarios. In contrast, regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Third, we delve into the correlation and forgetting dynamics between vision-language task pairs and propose task-similarity-informed regularization and model expansion methods for continual instruction tuning of LMMs. Experimental results show that our approach consistently boosts the model's performance.

Via

Access Paper or Ask Questions

FPM-INR: Fourier ptychographic microscopy image stack reconstruction using implicit neural representations

Oct 31, 2023

Haowen Zhou, Brandon Y. Feng, Haiyun Guo, Siyu Lin, Mingshu Liang, Christopher A. Metzler, Changhuei Yang

Figure 1 for FPM-INR: Fourier ptychographic microscopy image stack reconstruction using implicit neural representations

Figure 2 for FPM-INR: Fourier ptychographic microscopy image stack reconstruction using implicit neural representations

Figure 3 for FPM-INR: Fourier ptychographic microscopy image stack reconstruction using implicit neural representations

Figure 4 for FPM-INR: Fourier ptychographic microscopy image stack reconstruction using implicit neural representations

Abstract:Image stacks provide invaluable 3D information in various biological and pathological imaging applications. Fourier ptychographic microscopy (FPM) enables reconstructing high-resolution, wide field-of-view image stacks without z-stack scanning, thus significantly accelerating image acquisition. However, existing FPM methods take tens of minutes to reconstruct and gigabytes of memory to store a high-resolution volumetric scene, impeding fast gigapixel-scale remote digital pathology. While deep learning approaches have been explored to address this challenge, existing methods poorly generalize to novel datasets and can produce unreliable hallucinations. This work presents FPM-INR, a compact and efficient framework that integrates physics-based optical models with implicit neural representations (INR) to represent and reconstruct FPM image stacks. FPM-INR is agnostic to system design or sample types and does not require external training data. In our demonstrated experiments, FPM-INR substantially outperforms traditional FPM algorithms with up to a 25-fold increase in speed and an 80-fold reduction in memory usage for continuous image stack representations.

* Project Page: https://hwzhou2020.github.io/FPM-INR-Web/

Via

Access Paper or Ask Questions

ZBS: Zero-shot Background Subtraction via Instance-level Background Modeling and Foreground Selection

Mar 26, 2023

Yongqi An, Xu Zhao, Tao Yu, Haiyun Guo, Chaoyang Zhao, Ming Tang, Jinqiao Wang

Figure 1 for ZBS: Zero-shot Background Subtraction via Instance-level Background Modeling and Foreground Selection

Figure 2 for ZBS: Zero-shot Background Subtraction via Instance-level Background Modeling and Foreground Selection

Figure 3 for ZBS: Zero-shot Background Subtraction via Instance-level Background Modeling and Foreground Selection

Figure 4 for ZBS: Zero-shot Background Subtraction via Instance-level Background Modeling and Foreground Selection

Abstract:Background subtraction (BGS) aims to extract all moving objects in the video frames to obtain binary foreground segmentation masks. Deep learning has been widely used in this field. Compared with supervised-based BGS methods, unsupervised methods have better generalization. However, previous unsupervised deep learning BGS algorithms perform poorly in sophisticated scenarios such as shadows or night lights, and they cannot detect objects outside the pre-defined categories. In this work, we propose an unsupervised BGS algorithm based on zero-shot object detection called Zero-shot Background Subtraction (ZBS). The proposed method fully utilizes the advantages of zero-shot object detection to build the open-vocabulary instance-level background model. Based on it, the foreground can be effectively extracted by comparing the detection results of new frames with the background model. ZBS performs well for sophisticated scenarios, and it has rich and extensible categories. Furthermore, our method can easily generalize to other tasks, such as abandoned object detection in unseen environments. We experimentally show that ZBS surpasses state-of-the-art unsupervised BGS methods by 4.70% F-Measure on the CDnet 2014 dataset. The code is released at https://github.com/CASIA-IVA-Lab/ZBS.

* Accepted by CVPR 2023. Code is available at https://github.com/CASIA-IVA-Lab/ZBS

Via

Access Paper or Ask Questions