Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xudong Li

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

May 28, 2025

Xudong Li, Mengdan Zhang, Peixian Chen, Xiawu Zheng, Yan Zhang, Jingyuan Zheng, Yunhang Shen, Ke Li, Chaoyou Fu, Xing Sun(+1 more)

Abstract:Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details. It features: (i) Context-Level Optimization : Re-evaluates cognitive biases underlying MLLMs' multi-image context comprehension and integrates a spectrum of low-cost global sequence preferences for bias mitigation. (ii) Needle-Level Optimization : Directs attention to fine-grained visual details through region-targeted visual prompts and multimodal preference supervision. To support scalable optimization, we also construct MultiScope-42k, an automatically generated dataset with high-quality multi-level preference pairs. Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains across general single- and multi-image tasks.

Via

Access Paper or Ask Questions

Channel Fingerprint Construction for Massive MIMO: A Deep Conditional Generative Approach

May 12, 2025

Zhenzhou Jin, Li You, Xudong Li, Zhen Gao, Yuanwei Liu, Xiang-Gen Xia, Xiqi Gao

Abstract:Accurate channel state information (CSI) acquisition for massive multiple-input multiple-output (MIMO) systems is essential for future mobile communication networks. Channel fingerprint (CF), also referred to as channel knowledge map, is a key enabler for intelligent environment-aware communication and can facilitate CSI acquisition. However, due to the cost limitations of practical sensing nodes and test vehicles, the resulting CF is typically coarse-grained, making it insufficient for wireless transceiver design. In this work, we introduce the concept of CF twins and design a conditional generative diffusion model (CGDM) with strong implicit prior learning capabilities as the computational core of the CF twin to establish the connection between coarse- and fine-grained CFs. Specifically, we employ a variational inference technique to derive the evidence lower bound (ELBO) for the log-marginal distribution of the observed fine-grained CF conditioned on the coarse-grained CF, enabling the CGDM to learn the complicated distribution of the target data. During the denoising neural network optimization, the coarse-grained CF is introduced as side information to accurately guide the conditioned generation of the CGDM. To make the proposed CGDM lightweight, we further leverage the additivity of network layers and introduce a one-shot pruning approach along with a multi-objective knowledge distillation technique. Experimental results show that the proposed approach exhibits significant improvement in reconstruction performance compared to the baselines. Additionally, zero-shot testing on reconstruction tasks with different magnification factors further demonstrates the scalability and generalization ability of the proposed approach.

* 15 pages, 7 figures

Via

Access Paper or Ask Questions

RIS-assisted Physical Layer Security

Jan 30, 2025

Xudong Li, Matthias Frey, Ehsan Tohidi, Igor Bjelaković, Sławomir Stańczak

Figure 1 for RIS-assisted Physical Layer Security

Figure 2 for RIS-assisted Physical Layer Security

Figure 3 for RIS-assisted Physical Layer Security

Figure 4 for RIS-assisted Physical Layer Security

Abstract:We propose a reconfigurable intelligent surface (RIS)-assisted wiretap channel, where the RIS is strategically deployed to provide a spatial separation to the transmitter, and orthogonal combiners are employed at the legitimate receiver to extract the data streams from the direct and RIS-assisted links. Then we derive the achievable secrecy rate under semantic security for the RIS-assisted channel and design an algorithm for the secrecy rate optimization problem. The simulation results show the effects of total transmit power, the location and number of eavesdroppers on the security performance.

Via

Access Paper or Ask Questions

Alternating minimization for square root principal component pursuit

Dec 31, 2024

Shengxiang Deng, Xudong Li, Yangjing Zhang

Abstract:Recently, the square root principal component pursuit (SRPCP) model has garnered significant research interest. It is shown in the literature that the SRPCP model guarantees robust matrix recovery with a universal, constant penalty parameter. While its statistical advantages are well-documented, the computational aspects from an optimization perspective remain largely unexplored. In this paper, we focus on developing efficient optimization algorithms for solving the SRPCP problem. Specifically, we propose a tuning-free alternating minimization (AltMin) algorithm, where each iteration involves subproblems enjoying closed-form optimal solutions. Additionally, we introduce techniques based on the variational formulation of the nuclear norm and Burer-Monteiro decomposition to further accelerate the AltMin method. Extensive numerical experiments confirm the efficiency and robustness of our algorithms.

* 32 pages, 2 figures

Via

Access Paper or Ask Questions

Continuous Gaussian Process Pre-Optimization for Asynchronous Event-Inertial Odometry

Dec 12, 2024

Zhixiang Wang, Xudong Li, Yizhai Zhang, Fan Zhang, Panfeng Huang

Abstract:Event cameras, as bio-inspired sensors, are asynchronously triggered with high-temporal resolution compared to intensity cameras. Recent work has focused on fusing the event measurements with inertial measurements to enable ego-motion estimation in high-speed and HDR environments. However, existing methods predominantly rely on IMU preintegration designed mainly for synchronous sensors and discrete-time frameworks. In this paper, we propose a continuous-time preintegration method based on the Temporal Gaussian Process (TGP) called GPO. Concretely, we model the preintegration as a time-indexed motion trajectory and leverage an efficient two-step optimization to initialize the precision preintegration pseudo-measurements. Our method realizes a linear and constant time cost for initialization and query, respectively. To further validate the proposal, we leverage the GPO to design an asynchronous event-inertial odometry and compare with other asynchronous fusion schemes within the same odometry system. Experiments conducted on both public and own-collected datasets demonstrate that the proposed GPO offers significant advantages in terms of precision and efficiency, outperforming existing approaches in handling asynchronous sensor fusion.

* 8pages

Via

Access Paper or Ask Questions

Asynchronous Event-Inertial Odometry using a Unified Gaussian Process Regression Framework

Dec 04, 2024

Xudong Li, Zhixiang Wang, Zihao Liu, Yizhai Zhang, Fan Zhang, Xiuming Yao, Panfeng Huang

Figure 1 for Asynchronous Event-Inertial Odometry using a Unified Gaussian Process Regression Framework

Figure 2 for Asynchronous Event-Inertial Odometry using a Unified Gaussian Process Regression Framework

Figure 3 for Asynchronous Event-Inertial Odometry using a Unified Gaussian Process Regression Framework

Figure 4 for Asynchronous Event-Inertial Odometry using a Unified Gaussian Process Regression Framework

Abstract:Recent works have combined monocular event camera and inertial measurement unit to estimate the $SE(3)$ trajectory. However, the asynchronicity of event cameras brings a great challenge to conventional fusion algorithms. In this paper, we present an asynchronous event-inertial odometry under a unified Gaussian Process (GP) regression framework to naturally fuse asynchronous data associations and inertial measurements. A GP latent variable model is leveraged to build data-driven motion prior and acquire the analytical integration capacity. Then, asynchronous event-based feature associations and integral pseudo measurements are tightly coupled using the same GP framework. Subsequently, this fusion estimation problem is solved by underlying factor graph in a sliding-window manner. With consideration of sparsity, those historical states are marginalized orderly. A twin system is also designed for comparison, where the traditional inertial preintegration scheme is embedded in the GP-based framework to replace the GP latent variable model. Evaluations on public event-inertial datasets demonstrate the validity of both systems. Comparison experiments show competitive precision compared to the state-of-the-art synchronous scheme.

* Accepted at IEEE IROS 2024

Via

Access Paper or Ask Questions

AsynEIO: Asynchronous Monocular Event-Inertial Odometry Using Gaussian Process Regression

Nov 19, 2024

Zhixiang Wang, Xudong Li, Yizhai Zhang, Fan Zhang, Panfeng

Abstract:Event cameras, when combined with inertial sensors, show significant potential for motion estimation in challenging scenarios, such as high-speed maneuvers and low-light environments. There are many methods for producing such estimations, but most boil down to a synchronous discrete-time fusion problem. However, the asynchronous nature of event cameras and their unique fusion mechanism with inertial sensors remain underexplored. In this paper, we introduce a monocular event-inertial odometry method called AsynEIO, designed to fuse asynchronous event and inertial data within a unified Gaussian Process (GP) regression framework. Our approach incorporates an event-driven frontend that tracks feature trajectories directly from raw event streams at a high temporal resolution. These tracked feature trajectories, along with various inertial factors, are integrated into the same GP regression framework to enable asynchronous fusion. With deriving analytical residual Jacobians and noise models, our method constructs a factor graph that is iteratively optimized and pruned using a sliding-window optimizer. Comparative assessments highlight the performance of different inertial fusion strategies, suggesting optimal choices for varying conditions. Experimental results on both public datasets and our own event-inertial sequences indicate that AsynEIO outperforms existing methods, especially in high-speed and low-illumination scenarios.

* Submitted to IEEE (2024-11-4)

Via

Access Paper or Ask Questions

Scale Contrastive Learning with Selective Attentions for Blind Image Quality Assessment

Nov 13, 2024

Zihao Huang, Xudong Li, Bohan Fu, Xiaohui Chu, Ke Li, Yunhang Shen, Yan Zhang

Figure 1 for Scale Contrastive Learning with Selective Attentions for Blind Image Quality Assessment

Figure 2 for Scale Contrastive Learning with Selective Attentions for Blind Image Quality Assessment

Figure 3 for Scale Contrastive Learning with Selective Attentions for Blind Image Quality Assessment

Figure 4 for Scale Contrastive Learning with Selective Attentions for Blind Image Quality Assessment

Abstract:Blind image quality assessment (BIQA) serves as a fundamental task in computer vision, yet it often fails to consistently align with human subjective perception. Recent advances show that multi-scale evaluation strategies are promising due to their ability to replicate the hierarchical structure of human vision. However, the effectiveness of these strategies is limited by a lack of understanding of how different image scales influence perceived quality. This paper addresses two primary challenges: the significant redundancy of information across different scales, and the confusion caused by combining features from these scales, which may vary widely in quality. To this end, a new multi-scale BIQA framework is proposed, namely Contrast-Constrained Scale-Focused IQA Framework (CSFIQA). CSFIQA features a selective focus attention mechanism to minimize information redundancy and highlight critical quality-related information. Additionally, CSFIQA includes a scale-level contrastive learning module equipped with a noise sample matching mechanism to identify quality discrepancies across the same image content at different scales. By exploring the intrinsic relationship between image scales and the perceived quality, the proposed CSFIQA achieves leading performance on eight benchmark datasets, e.g., achieving SRCC values of 0.967 (versus 0.947 in CSIQ) and 0.905 (versus 0.876 in LIVEC).

Via

Access Paper or Ask Questions

Boosting CLIP Adaptation for Image Quality Assessment via Meta-Prompt Learning and Gradient Regularization

Sep 09, 2024

Xudong Li, Zihao Huang, Runze Hu, Yan Zhang, Liujuan Cao, Rongrong Ji

Figure 1 for Boosting CLIP Adaptation for Image Quality Assessment via Meta-Prompt Learning and Gradient Regularization

Figure 2 for Boosting CLIP Adaptation for Image Quality Assessment via Meta-Prompt Learning and Gradient Regularization

Figure 3 for Boosting CLIP Adaptation for Image Quality Assessment via Meta-Prompt Learning and Gradient Regularization

Figure 4 for Boosting CLIP Adaptation for Image Quality Assessment via Meta-Prompt Learning and Gradient Regularization

Abstract:Image Quality Assessment (IQA) remains an unresolved challenge in the field of computer vision, due to complex distortion conditions, diverse image content, and limited data availability. The existing Blind IQA (BIQA) methods heavily rely on extensive human annotations to train models, which is both labor-intensive and costly due to the demanding nature of creating IQA datasets. To mitigate the dependence on labeled samples, this paper introduces a novel Gradient-Regulated Meta-Prompt IQA Framework (GRMP-IQA). This framework aims to fast adapt the powerful visual-language pre-trained model, CLIP, to downstream IQA tasks, significantly improving accuracy in scenarios with limited data. Specifically, the GRMP-IQA comprises two key modules: Meta-Prompt Pre-training Module and Quality-Aware Gradient Regularization. The Meta Prompt Pre-training Module leverages a meta-learning paradigm to pre-train soft prompts with shared meta-knowledge across different distortions, enabling rapid adaptation to various IQA tasks. On the other hand, the Quality-Aware Gradient Regularization is designed to adjust the update gradients during fine-tuning, focusing the model's attention on quality-relevant features and preventing overfitting to semantic information. Extensive experiments on five standard BIQA datasets demonstrate the superior performance to the state-of-the-art BIQA methods under limited data setting, i.e., achieving SRCC values of 0.836 (vs. 0.760 on LIVEC) and 0.853 (vs. 0.812 on KonIQ). Notably, utilizing just 20\% of the training data, our GRMP-IQA outperforms most existing fully supervised BIQA methods.

Via

Access Paper or Ask Questions

Decoupling Feature Representations of Ego and Other Modalities for Incomplete Multi-modal Brain Tumor Segmentation

Aug 16, 2024

Kaixiang Yang, Wenqi Shan, Xudong Li, Xuan Wang, Xikai Yang, Xi Wang, Pheng-Ann Heng, Qiang Li, Zhiwei Wang

Abstract:Multi-modal brain tumor segmentation typically involves four magnetic resonance imaging (MRI) modalities, while incomplete modalities significantly degrade performance. Existing solutions employ explicit or implicit modality adaptation, aligning features across modalities or learning a fused feature robust to modality incompleteness. They share a common goal of encouraging each modality to express both itself and the others. However, the two expression abilities are entangled as a whole in a seamless feature space, resulting in prohibitive learning burdens. In this paper, we propose DeMoSeg to enhance the modality adaptation by Decoupling the task of representing the ego and other Modalities for robust incomplete multi-modal Segmentation. The decoupling is super lightweight by simply using two convolutions to map each modality onto four feature sub-spaces. The first sub-space expresses itself (Self-feature), while the remaining sub-spaces substitute for other modalities (Mutual-features). The Self- and Mutual-features interactively guide each other through a carefully-designed Channel-wised Sparse Self-Attention (CSSA). After that, a Radiologist-mimic Cross-modality expression Relationships (RCR) is introduced to have available modalities provide Self-feature and also `lend' their Mutual-features to compensate for the absent ones by exploiting the clinical prior knowledge. The benchmark results on BraTS2020, BraTS2018 and BraTS2015 verify the DeMoSeg's superiority thanks to the alleviated modality adaptation difficulty. Concretely, for BraTS2020, DeMoSeg increases Dice by at least 0.92%, 2.95% and 4.95% on whole tumor, tumor core and enhanced tumor regions, respectively, compared to other state-of-the-arts. Codes are at https://github.com/kk42yy/DeMoSeg

* 8 pages, 4 figures

Via

Access Paper or Ask Questions