Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaorui Ma

Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective

Feb 03, 2025

Xiaorui Ma, Haoran Xie, S. Joe Qin

Figure 1 for Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective

Figure 2 for Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective

Figure 3 for Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective

Figure 4 for Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective

Abstract:The integration of vision-language modalities has been a significant focus in multimodal learning, traditionally relying on Vision-Language Pretrained Models. However, with the advent of Large Language Models (LLMs), there has been a notable shift towards incorporating LLMs with vision modalities. Following this, the training paradigms for incorporating vision modalities into LLMs have evolved. Initially, the approach was to integrate the modalities through pretraining the modality integrator, named Single-stage Tuning. It has since branched out into methods focusing on performance enhancement, denoted as Two-stage Tuning, and those prioritizing parameter efficiency, referred to as Direct Adaptation. However, existing surveys primarily address the latest Vision Large Language Models (VLLMs) with Two-stage Tuning, leaving a gap in understanding the evolution of training paradigms and their unique parameter-efficient considerations. This paper categorizes and reviews 34 VLLMs from top conferences, journals, and highly cited Arxiv papers, focusing on parameter efficiency during adaptation from the training paradigm perspective. We first introduce the architecture of LLMs and parameter-efficient learning methods, followed by a discussion on vision encoders and a comprehensive taxonomy of modality integrators. We then review three training paradigms and their efficiency considerations, summarizing benchmarks in the VLLM field. To gain deeper insights into their effectiveness in parameter efficiency, we compare and discuss the experimental results of representative models, among which the experiment of the Direct Adaptation paradigm is replicated. Providing insights into recent developments and practical uses, this survey is a vital guide for researchers and practitioners navigating the efficient integration of vision modalities into LLMs.

* 28 pages, 3 figures

Via

Access Paper or Ask Questions

HTD-Mamba: Efficient Hyperspectral Target Detection with Pyramid State Space Model

Jul 09, 2024

Dunbin Shen, Xuanbing Zhu, Jiacheng Tian, Jianjun Liu, Zhenrong Du, Hongyu Wang, Xiaorui Ma

Abstract:Hyperspectral target detection (HTD) identifies objects of interest from complex backgrounds at the pixel level, playing a vital role in Earth observation. However, HTD faces challenges due to limited prior knowledge and spectral variations, leading to underfitting models and unreliable performance. To address these challenges, this paper proposes an efficient self-supervised HTD method with a pyramid state space model (SSM), named HTD-Mamba, which employs spectrally contrastive learning to distinguish between target and background based on the similarity measurement of intrinsic features. Specifically, to obtain sufficient training samples and leverage spatial contextual information, we propose a spatial-encoded spectral augmentation technique that encodes all surrounding pixels within a patch into a transformed view of the central pixel. Additionally, to explore global band correlations, we divide pixels into continuous group-wise spectral embeddings and introduce Mamba to HTD for the first time to model long-range dependencies of the spectral sequence with linear complexity. Furthermore, to alleviate spectral variation and enhance robust representation, we propose a pyramid SSM as a backbone to capture and fuse multiresolution spectral-wise intrinsic features. Extensive experiments conducted on four public datasets demonstrate that the proposed method outperforms state-of-the-art methods in both quantitative and qualitative evaluations. Code is available at \url{https://github.com/shendb2022/HTD-Mamba}.

* 13 pages,6 figures, 5 tables

Via

Access Paper or Ask Questions

Hyperspectral Target Detection Based on Low-Rank Background Subspace Learning and Graph Laplacian Regularization

Jun 01, 2023

Dunbin Shen, Xiaorui Ma, Wenfeng Kong, Jiacheng Tian, Hongyu Wang

Abstract:Hyperspectral target detection is good at finding dim and small objects based on spectral characteristics. However, existing representation-based methods are hindered by the problem of the unknown background dictionary and insufficient utilization of spatial information. To address these issues, this paper proposes an efficient optimizing approach based on low-rank representation (LRR) and graph Laplacian regularization (GLR). Firstly, to obtain a complete and pure background dictionary, we propose a LRR-based background subspace learning method by jointly mining the low-dimensional structure of all pixels. Secondly, to fully exploit local spatial relationships and capture the underlying geometric structure, a local region-based GLR is employed to estimate the coefficients. Finally, the desired detection map is generated by computing the ratio of representation errors from binary hypothesis testing. The experiments conducted on two benchmark datasets validate the effectiveness and superiority of the approach. For reproduction, the accompanying code is available at https://github.com/shendb2022/LRBSL-GLR.

* 4 pages, 3 figures, 1 table

Via

Access Paper or Ask Questions