Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaoning Song

OCCO: LVM-guided Infrared and Visible Image Fusion Framework based on Object-aware and Contextual COntrastive Learning

Mar 24, 2025

Hui Li, Congcong Bian, Zeyang Zhang, Xiaoning Song, Xi Li, Xiao-Jun Wu

Abstract:Image fusion is a crucial technique in the field of computer vision, and its goal is to generate high-quality fused images and improve the performance of downstream tasks. However, existing fusion methods struggle to balance these two factors. Achieving high quality in fused images may result in lower performance in downstream visual tasks, and vice versa. To address this drawback, a novel LVM (large vision model)-guided fusion framework with Object-aware and Contextual COntrastive learning is proposed, termed as OCCO. The pre-trained LVM is utilized to provide semantic guidance, allowing the network to focus solely on fusion tasks while emphasizing learning salient semantic features in form of contrastive learning. Additionally, a novel feature interaction fusion network is also designed to resolve information conflicts in fusion images caused by modality differences. By learning the distinction between positive samples and negative samples in the latent feature space (contextual space), the integrity of target information in fused image is improved, thereby benefiting downstream performance. Finally, compared with eight state-of-the-art methods on four datasets, the effectiveness of the proposed method is validated, and exceptional performance is also demonstrated on downstream visual task.

Via

Access Paper or Ask Questions

Learning a Unified Degradation-aware Representation Model for Multi-modal Image Fusion

Mar 10, 2025

Haolong Ma, Hui Li, Chunyang Cheng, Zeyang Zhang, Xiaoning Song, Xiao-Jun Wu

Figure 1 for Learning a Unified Degradation-aware Representation Model for Multi-modal Image Fusion

Figure 2 for Learning a Unified Degradation-aware Representation Model for Multi-modal Image Fusion

Figure 3 for Learning a Unified Degradation-aware Representation Model for Multi-modal Image Fusion

Figure 4 for Learning a Unified Degradation-aware Representation Model for Multi-modal Image Fusion

Abstract:All-in-One Degradation-Aware Fusion Models (ADFMs), a class of multi-modal image fusion models, address complex scenes by mitigating degradations from source images and generating high-quality fused images. Mainstream ADFMs often rely on highly synthetic multi-modal multi-quality images for supervision, limiting their effectiveness in cross-modal and rare degradation scenarios. The inherent relationship among these multi-modal, multi-quality images of the same scene provides explicit supervision for training, but also raises above problems. To address these limitations, we present LURE, a Learning-driven Unified Representation model for infrared and visible Image Fusion, which is degradation-aware. LURE decouples multi-modal multi-quality data at the data level and recouples this relationship in a unified latent feature space (ULFS) by proposing a novel unified loss. This decoupling circumvents data-level limitations of prior models and allows leveraging real-world restoration datasets for training high-quality degradation-aware models, sidestepping above issues. To enhance text-image interaction, we refine image-text interaction and residual structures via Text-Guided Attention (TGA) and an inner residual structure. These enhances text's spatial perception of images and preserve more visual details. Experiments show our method outperforms state-of-the-art (SOTA) methods across general fusion, degradation-aware fusion, and downstream tasks. The code will be publicly available.

Via

Access Paper or Ask Questions

Learning Structure-Supporting Dependencies via Keypoint Interactive Transformer for General Mammal Pose Estimation

Feb 25, 2025

Tianyang Xu, Jiyong Rao, Xiaoning Song, Zhenhua Feng, Xiao-Jun Wu

Abstract:General mammal pose estimation is an important and challenging task in computer vision, which is essential for understanding mammal behaviour in real-world applications. However, existing studies are at their preliminary research stage, which focus on addressing the problem for only a few specific mammal species. In principle, from specific to general mammal pose estimation, the biggest issue is how to address the huge appearance and pose variances for different species. We argue that given appearance context, instance-level prior and the structural relation among keypoints can serve as complementary evidence. To this end, we propose a Keypoint Interactive Transformer (KIT) to learn instance-level structure-supporting dependencies for general mammal pose estimation. Specifically, our KITPose consists of two coupled components. The first component is to extract keypoint features and generate body part prompts. The features are supervised by a dedicated generalised heatmap regression loss (GHRL). Instead of introducing external visual/text prompts, we devise keypoints clustering to generate body part biases, aligning them with image context to generate corresponding instance-level prompts. Second, we propose a novel interactive transformer that takes feature slices as input tokens without performing spatial splitting. In addition, to enhance the capability of the KIT model, we design an adaptive weight strategy to address the imbalance issue among different keypoints.

* accepted by IJCV 2025

Via

Access Paper or Ask Questions

CDeFuse: Continuous Decomposition for Infrared and Visible Image Fusion

Jun 07, 2024

Haolong Ma, Hui Li, Chunyang Cheng, Xiaoning Song, Zhongwei Shen

Abstract:As a common image processing technique, image decomposition is often used to extract complementary information between modalities. In current decomposition-based image fusion methods, typically, source images are decomposed into three parts at single scale (i.e., visible-exclusive part, infrared-exclusive part, and common part) and lacking interaction between modalities during the decomposition process. These results in the inability of fusion images to effectively focus on finer complementary information between modalities at various scales. To address the above issue, a novel decomposition mechanism, Continuous Decomposition Fusion (CDeFuse), is proposed. Firstly, CDeFuse extends the original three-part decomposition to a more general K-part decomposition at each scale through similarity constraints to fuse multi-scale information and achieve a finer representation of decomposition features. Secondly, a Continuous Decomposition Module (CDM) is introduced to assist K-part decomposition. Its core component, State Transformer (ST), efficiently captures complementary information between modalities by utilizing multi-head self-attention mechanism. Finally, a novel decomposition loss function and the corresponding computational optimization strategy are utilized to ensure the smooth progress of the decomposition process while maintaining linear growth in time complexity with the number of decomposition results K. Extensive experiments demonstrate that our CDeFuse achieves comparable performance compared to previous methods. The code will be publicly available.

Via

Access Paper or Ask Questions

S4Fusion: Saliency-aware Selective State Space Model for Infrared Visible Image Fusion

Jun 03, 2024

Haolong Ma, Hui Li, Chunyang Cheng, Gaoang Wang, Xiaoning Song, Xiaojun Wu

Abstract:As one of the tasks in Image Fusion, Infrared and Visible Image Fusion aims to integrate complementary information captured by sensors of different modalities into a single image. The Selective State Space Model (SSSM), known for its ability to capture long-range dependencies, has demonstrated its potential in the field of computer vision. However, in image fusion, current methods underestimate the potential of SSSM in capturing the global spatial information of both modalities. This limitation prevents the simultaneous consideration of the global spatial information from both modalities during interaction, leading to a lack of comprehensive perception of salient targets. Consequently, the fusion results tend to bias towards one modality instead of adaptively preserving salient targets. To address this issue, we propose the Saliency-aware Selective State Space Fusion Model (S4Fusion). In our S4Fusion, the designed Cross-Modal Spatial Awareness Module (CMSA) can simultaneously focus on global spatial information from both modalities while facilitating their interaction, thereby comprehensively capturing complementary information. Additionally, S4Fusion leverages a pre-trained network to perceive uncertainty in the fused images. By minimizing this uncertainty, S4Fusion adaptively highlights salient targets from both images. Extensive experiments demonstrate that our approach produces high-quality images and enhances performance in downstream tasks.

Via

Access Paper or Ask Questions

CoMoFusion: Fast and High-quality Fusion of Infrared and Visible Image with Consistency Model

May 31, 2024

Zhiming Meng, Hui Li, Zeyang Zhang, Zhongwei Shen, Yunlong Yu, Xiaoning Song, Xiaojun Wu

Abstract:Generative models are widely utilized to model the distribution of fused images in the field of infrared and visible image fusion. However, current generative models based fusion methods often suffer from unstable training and slow inference speed. To tackle this problem, a novel fusion method based on consistency model is proposed, termed as CoMoFusion, which can generate the high-quality images and achieve fast image inference speed. In specific, the consistency model is used to construct multi-modal joint features in the latent space with the forward and reverse process. Then, the infrared and visible features extracted by the trained consistency model are fed into fusion module to generate the final fused image. In order to enhance the texture and salient information of fused images, a novel loss based on pixel value selection is also designed. Extensive experiments on public datasets illustrate that our method obtains the SOTA fusion performance compared with the existing fusion methods.

Via

Access Paper or Ask Questions

LE2Fusion: A novel local edge enhancement module for infrared and visible image fusion

May 27, 2023

Yongbiao Xiao, Hui Li, Chunyang Cheng, Xiaoning Song

Figure 1 for LE2Fusion: A novel local edge enhancement module for infrared and visible image fusion

Figure 2 for LE2Fusion: A novel local edge enhancement module for infrared and visible image fusion

Figure 3 for LE2Fusion: A novel local edge enhancement module for infrared and visible image fusion

Figure 4 for LE2Fusion: A novel local edge enhancement module for infrared and visible image fusion

Abstract:Infrared and visible image fusion task aims to generate a fused image which contains salient features and rich texture details from multi-source images. However, under complex illumination conditions, few algorithms pay attention to the edge information of local regions which is crucial for downstream tasks. To this end, we propose a fusion network based on the local edge enhancement, named LE2Fusion. Specifically, a local edge enhancement (LE2) module is proposed to improve the edge information under complex illumination conditions and preserve the essential features of image. For feature extraction, a multi-scale residual attention (MRA) module is applied to extract rich features. Then, with LE2, a set of enhancement weights are generated which are utilized in feature fusion strategy and used to guide the image reconstruction. To better preserve the local detail information and structure information, the pixel intensity loss function based on the local region is also presented. The experiments demonstrate that the proposed method exhibits better fusion performance than the state-of-the-art fusion methods on public datasets.

Via

Access Paper or Ask Questions

WavePF: A Novel Fusion Approach based on Wavelet-guided Pooling for Infrared and Visible Images

May 27, 2023

Hui Li, Yongbiao Xiao, Chunyang Cheng, Zhongwei Shen, Xiaoning Song

Figure 1 for WavePF: A Novel Fusion Approach based on Wavelet-guided Pooling for Infrared and Visible Images

Figure 2 for WavePF: A Novel Fusion Approach based on Wavelet-guided Pooling for Infrared and Visible Images

Figure 3 for WavePF: A Novel Fusion Approach based on Wavelet-guided Pooling for Infrared and Visible Images

Figure 4 for WavePF: A Novel Fusion Approach based on Wavelet-guided Pooling for Infrared and Visible Images

Abstract:Infrared and visible image fusion aims to generate synthetic images simultaneously containing salient features and rich texture details, which can be used to boost downstream tasks. However, existing fusion methods are suffering from the issues of texture loss and edge information deficiency, which result in suboptimal fusion results. Meanwhile, the straight-forward up-sampling operator can not well preserve the source information from multi-scale features. To address these issues, a novel fusion network based on the wavelet-guided pooling (wave-pooling) manner is proposed, termed as WavePF. Specifically, a wave-pooling based encoder is designed to extract multi-scale image and detail features of source images at the same time. In addition, the spatial attention model is used to aggregate these salient features. After that, the fused features will be reconstructed by the decoder, in which the up-sampling operator is replaced by the wave-pooling reversed operation. Different from the common max-pooling technique, image features after the wave-pooling layer can retain abundant details information, which can benefit the fusion process. In this case, rich texture details and multi-scale information can be maintained during the reconstruction phase. The experimental results demonstrate that our method exhibits superior fusion performance over the state-of-the-arts on multiple image fusion benchmarks

Via

Access Paper or Ask Questions

LabelPrompt: Effective Prompt-based Learning for Relation Classification

Feb 16, 2023

Wenjie Zhang, Xiaoning Song, Zhenhua Feng, Tianyang Xu, Xiaojun Wu

Abstract:Recently, prompt-based learning has become a very popular solution in many Natural Language Processing (NLP) tasks by inserting a template into model input, which converts the task into a cloze-style one to smoothing out differences between the Pre-trained Language Model (PLM) and the current task. But in the case of relation classification, it is difficult to map the masked output to the relation labels because of its abundant semantic information, e.g. org:founded_by''. Therefore, a pre-trained model still needs enough labelled data to fit the relations. To mitigate this challenge, in this paper, we present a novel prompt-based learning method, namely LabelPrompt, for the relation classification task. It is an extraordinary intuitive approach by a motivation: ``GIVE MODEL CHOICES!''. First, we define some additional tokens to represent the relation labels, which regards these tokens as the verbalizer with semantic initialisation and constructs them with a prompt template method. Then we revisit the inconsistency of the predicted relation and the given entities, an entity-aware module with the thought of contrastive learning is designed to mitigate the problem. At last, we apply an attention query strategy to self-attention layers to resolve two types of tokens, prompt tokens and sequence tokens. The proposed strategy effectively improves the adaptation capability of prompt-based learning in the relation classification task when only a small labelled data is available. Extensive experimental results obtained on several bench-marking datasets demonstrate the superiority of the proposed LabelPrompt method, particularly in the few-shot scenario.

Via

Access Paper or Ask Questions

NFLAT: Non-Flat-Lattice Transformer for Chinese Named Entity Recognition

May 19, 2022

Shuang Wu, Xiaoning Song, Zhenhua Feng, Xiao-Jun Wu

Figure 1 for NFLAT: Non-Flat-Lattice Transformer for Chinese Named Entity Recognition

Figure 2 for NFLAT: Non-Flat-Lattice Transformer for Chinese Named Entity Recognition

Figure 3 for NFLAT: Non-Flat-Lattice Transformer for Chinese Named Entity Recognition

Figure 4 for NFLAT: Non-Flat-Lattice Transformer for Chinese Named Entity Recognition

Abstract:Recently, Flat-LAttice Transformer (FLAT) has achieved great success in Chinese Named Entity Recognition (NER). FLAT performs lexical enhancement by constructing flat lattices, which mitigates the difficulties posed by blurred word boundaries and the lack of word semantics. In FLAT, the positions of starting and ending characters are used to connect a matching word. However, this method is likely to match more words when dealing with long texts, resulting in long input sequences. Therefore, it significantly increases the memory and computational costs of the self-attention module. To deal with this issue, we advocate a novel lexical enhancement method, InterFormer, that effectively reduces the amount of computational and memory costs by constructing non-flat lattices. Furthermore, with InterFormer as the backbone, we implement NFLAT for Chinese NER. NFLAT decouples lexicon fusion and context feature encoding. Compared with FLAT, it reduces unnecessary attention calculations in "word-character" and "word-word". This reduces the memory usage by about 50% and can use more extensive lexicons or higher batches for network training. The experimental results obtained on several well-known benchmarks demonstrate the superiority of the proposed method over the state-of-the-art hybrid (character-word) models.

* 18 pages, 6 figures, 8 tables, Under Review for NeurIPS 2022

Via

Access Paper or Ask Questions