Abstract:CLIP-based domain generalization aims to improve model generalization to unseen domains by leveraging the powerful zero-shot classification capabilities of CLIP and multiple source datasets. Existing methods typically train a single model across multiple source domains to capture domain-shared information. However, this paradigm inherently suffers from two types of conflicts: 1) sample conflicts, arising from noisy samples and extreme domain shifts among sources; and 2) optimization conflicts, stemming from competition and trade-offs during multi-source training. Both hinder the generalization and lead to suboptimal solutions. Recent studies have shown that model merging can effectively mitigate the competition of multi-objective optimization and improve generalization performance. Inspired by these findings, we propose Harmonizing and Merging (HAM), a novel source model merging framework for CLIP-based domain generalization. During the training process of the source models, HAM enriches the source samples without conflicting samples, and harmonizes the update directions of all models. Then, a redundancy-aware historical model merging method is introduced to effectively integrate knowledge across all source models. HAM comprehensively consolidates source domain information while enabling mutual enhancement among source models, ultimately yielding a final model with optimal generalization capabilities. Extensive experiments on five widely used benchmark datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance.
Abstract:Multi-spectral object re-identification (ReID) brings a new perception perspective for smart city and intelligent transportation applications, effectively addressing challenges from complex illumination and adverse weather. However, complex modal differences between heterogeneous spectra pose challenges to efficiently utilizing complementary and discrepancy of spectra information. Most existing methods fuse spectral data through intricate modal interaction modules, lacking fine-grained semantic understanding of spectral information (\textit{e.g.}, text descriptions, part masks, and object keypoints). To solve this challenge, we propose a novel Identity-Conditional text Prompt Learning framework (ICPL), which exploits the powerful cross-modal alignment capability of CLIP, to unify different spectral visual features from text semantics. Specifically, we first propose the online prompt learning using learnable text prompt as the identity-level semantic center to bridge the identity semantics of different spectra in online manner. Then, in lack of concrete text descriptions, we propose the multi-spectral identity-condition module to use identity prototype as spectral identity condition to constraint prompt learning. Meanwhile, we construct the alignment loop mutually optimizing the learnable text prompt and spectral visual encoder to avoid online prompt learning disrupting the pre-trained text-image alignment distribution. In addition, to adapt to small-scale multi-spectral data and mitigate style differences between spectra, we propose multi-spectral adapter that employs a low-rank adaption method to learn spectra-specific features. Comprehensive experiments on 5 benchmarks, including RGBNT201, Market-MM, MSVR310, RGBN300, and RGBNT100, demonstrate that the proposed method outperforms the state-of-the-art methods.
Abstract:Existing tracking algorithms typically rely on low-frame-rate RGB cameras coupled with computationally intensive deep neural network architectures to achieve effective tracking. However, such frame-based methods inherently face challenges in achieving low-latency performance and often fail in resource-constrained environments. Visual object tracking using bio-inspired event cameras has emerged as a promising research direction in recent years, offering distinct advantages for low-latency applications. In this paper, we propose a novel Slow-Fast Tracking paradigm that flexibly adapts to different operational requirements, termed SFTrack. The proposed framework supports two complementary modes, i.e., a high-precision slow tracker for scenarios with sufficient computational resources, and an efficient fast tracker tailored for latency-aware, resource-constrained environments. Specifically, our framework first performs graph-based representation learning from high-temporal-resolution event streams, and then integrates the learned graph-structured information into two FlashAttention-based vision backbones, yielding the slow and fast trackers, respectively. The fast tracker achieves low latency through a lightweight network design and by producing multiple bounding box outputs in a single forward pass. Finally, we seamlessly combine both trackers via supervised fine-tuning and further enhance the fast tracker's performance through a knowledge distillation strategy. Extensive experiments on public benchmarks, including FE240, COESOT, and EventVOT, demonstrate the effectiveness and efficiency of our proposed method across different real-world scenarios. The source code has been released on https://github.com/Event-AHU/SlowFast_Event_Track.
Abstract:Accurate detection of changes in roads and bridges, such as construction, renovation, and demolition, is essential for urban planning and traffic management. However, existing methods often struggle to extract fine-grained semantic change information due to the lack of high-quality annotated datasets in traffic scenarios. To address this, we introduce the Road and Bridge Semantic Change Detection (RB-SCD) dataset, a comprehensive benchmark comprising 260 pairs of high-resolution remote sensing images from diverse cities and countries. RB-SCD captures 11 types of semantic changes across varied road and bridge structures, enabling detailed structural and functional analysis. Building on this dataset, we propose a novel framework, Multimodal Frequency-Driven Change Detector (MFDCD), which integrates multimodal features in the frequency domain. MFDCD includes a Dynamic Frequency Coupler (DFC) that fuses hierarchical visual features with wavelet-based frequency components, and a Textual Frequency Filter (TFF) that transforms CLIP-derived textual features into the frequency domain and applies graph-based filtering. Experimental results on RB-SCD and three public benchmarks demonstrate the effectiveness of our approach.
Abstract:Event-based Vision Sensors (EVS) have demonstrated significant advantages over traditional RGB frame-based cameras in low-light conditions, high-speed motion capture, and low latency. Consequently, object detection based on EVS has attracted increasing attention from researchers. Current event stream object detection algorithms are typically built upon Convolutional Neural Networks (CNNs) or Transformers, which either capture limited local features using convolutional filters or incur high computational costs due to the utilization of self-attention. Recently proposed vision heat conduction backbone networks have shown a good balance between efficiency and accuracy; however, these models are not specifically designed for event stream data. They exhibit weak capability in modeling object contour information and fail to exploit the benefits of multi-scale features. To address these issues, this paper proposes a novel dynamic graph induced contour-aware heat conduction network for event stream based object detection, termed CvHeat-DET. The proposed model effectively leverages the clear contour information inherent in event streams to predict the thermal diffusivity coefficients within the heat conduction model, and integrates hierarchical structural graph features to enhance feature learning across multiple scales. Extensive experiments on three benchmark datasets for event stream-based object detection fully validated the effectiveness of the proposed model. The source code of this paper will be released on https://github.com/Event-AHU/OpenEvDET.
Abstract:Event cameras have attracted increasing attention in recent years due to their advantages in high dynamic range, high temporal resolution, low power consumption, and low latency. Some researchers have begun exploring pre-training directly on event data. Nevertheless, these efforts often fail to establish strong connections with RGB frames, limiting their applicability in multi-modal fusion scenarios. To address these issues, we propose a novel CM3AE pre-training framework for the RGB-Event perception. This framework accepts multi-modalities/views of data as input, including RGB images, event images, and event voxels, providing robust support for both event-based and RGB-event fusion based downstream tasks. Specifically, we design a multi-modal fusion reconstruction module that reconstructs the original image from fused multi-modal features, explicitly enhancing the model's ability to aggregate cross-modal complementary information. Additionally, we employ a multi-modal contrastive learning strategy to align cross-modal feature representations in a shared latent space, which effectively enhances the model's capability for multi-modal understanding and capturing global dependencies. We construct a large-scale dataset containing 2,535,759 RGB-Event data pairs for the pre-training. Extensive experiments on five downstream tasks fully demonstrated the effectiveness of CM3AE. Source code and pre-trained models will be released on https://github.com/Event-AHU/CM3AE.
Abstract:Remote sensing object detection (RSOD) faces formidable challenges in complex visual environments. Aerial and satellite images inherently suffer from limitations such as low spatial resolution, sensor noise, blurred objects, low-light degradation, and partial occlusions. These degradation factors collectively compromise the feature discriminability in detection models, resulting in three key issues: (1) reduced contrast that hampers foreground-background separation, (2) structural discontinuities in edge representations, and (3) ambiguous feature responses caused by variations in illumination. These collectively weaken model robustness and deployment feasibility. To address these challenges, we propose LEGNet, a lightweight network that incorporates a novel edge-Gaussian aggregation (EGA) module specifically designed for low-quality remote sensing images. Our key innovation lies in the synergistic integration of Scharr operator-based edge priors with uncertainty-aware Gaussian modeling: (a) The orientation-aware Scharr filters preserve high-frequency edge details with rotational invariance; (b) The uncertainty-aware Gaussian layers probabilistically refine low-confidence features through variance estimation. This design enables precision enhancement while maintaining architectural simplicity. Comprehensive evaluations across four RSOD benchmarks (DOTA-v1.0, v1.5, DIOR-R, FAIR1M-v1.0) and a UAV-view dataset (VisDrone2019) demonstrate significant improvements. LEGNet achieves state-of-the-art performance across five benchmark datasets while ensuring computational efficiency, making it well-suited for deployment on resource-constrained edge devices in real-world remote sensing applications. The code is available at https://github.com/lwCVer/LEGNet.
Abstract:Existing multimodal tracking studies focus on bi-modal scenarios such as RGB-Thermal, RGB-Event, and RGB-Language. Although promising tracking performance is achieved through leveraging complementary cues from different sources, it remains challenging in complex scenes due to the limitations of bi-modal scenarios. In this work, we introduce a general multimodal visual tracking task that fully exploits the advantages of four modalities, including RGB, thermal infrared, event, and language, for robust tracking under challenging conditions. To provide a comprehensive evaluation platform for general multimodal visual tracking, we construct QuadTrack600, a large-scale, high-quality benchmark comprising 600 video sequences (totaling 384.7K high-resolution (640x480) frame groups). In each frame group, all four modalities are spatially aligned and meticulously annotated with bounding boxes, while 21 sequence-level challenge attributes are provided for detailed performance analysis. Despite quad-modal data provides richer information, the differences in information quantity among modalities and the computational burden from four modalities are two challenging issues in fusing four modalities. To handle these issues, we propose a novel approach called QuadFusion, which incorporates an efficient Multiscale Fusion Mamba with four different scanning scales to achieve sufficient interactions of the four modalities while overcoming the exponential computational burden, for general multimodal visual tracking. Extensive experiments on the QuadTrack600 dataset and three bi-modal tracking datasets, including LasHeR, VisEvent, and TNL2K, validate the effectiveness of our QuadFusion.
Abstract:Current RGBT tracking methods often overlook the impact of fusion location on mitigating modality gap, which is key factor to effective tracking. Our analysis reveals that shallower fusion yields smaller distribution gap. However, the limited discriminative power of shallow networks hard to distinguish task-relevant information from noise, limiting the potential of pixel-level fusion. To break shallow limits, we propose a novel \textbf{T}ask-driven \textbf{P}ixel-level \textbf{F}usion network, named \textbf{TPF}, which unveils the power of pixel-level fusion in RGBT tracking through a progressive learning framework. In particular, we design a lightweight Pixel-level Fusion Adapter (PFA) that exploits Mamba's linear complexity to ensure real-time, low-latency RGBT tracking. To enhance the fusion capabilities of the PFA, our task-driven progressive learning framework first utilizes adaptive multi-expert distillation to inherits fusion knowledge from state-of-the-art image fusion models, establishing robust initialization, and then employs a decoupled representation learning scheme to achieve task-relevant information fusion. Moreover, to overcome appearance variations between the initial template and search frames, we presents a nearest-neighbor dynamic template updating scheme, which selects the most reliable frame closest to the current search frame as the dynamic template. Extensive experiments demonstrate that TPF significantly outperforms existing most of advanced trackers on four public RGBT tracking datasets. The code will be released upon acceptance.
Abstract:Existing multimodal UAV object detection methods often overlook the impact of semantic gaps between modalities, which makes it difficult to achieve accurate semantic and spatial alignments, limiting detection performance. To address this problem, we propose a Large Language Model (LLM) guided Progressive feature Alignment Network called LPANet, which leverages the semantic features extracted from a large language model to guide the progressive semantic and spatial alignment between modalities for multimodal UAV object detection. To employ the powerful semantic representation of LLM, we generate the fine-grained text descriptions of each object category by ChatGPT and then extract the semantic features using the large language model MPNet. Based on the semantic features, we guide the semantic and spatial alignments in a progressive manner as follows. First, we design the Semantic Alignment Module (SAM) to pull the semantic features and multimodal visual features of each object closer, alleviating the semantic differences of objects between modalities. Second, we design the Explicit Spatial alignment Module (ESM) by integrating the semantic relations into the estimation of feature-level offsets, alleviating the coarse spatial misalignment between modalities. Finally, we design the Implicit Spatial alignment Module (ISM), which leverages the cross-modal correlations to aggregate key features from neighboring regions to achieve implicit spatial alignment. Comprehensive experiments on two public multimodal UAV object detection datasets demonstrate that our approach outperforms state-of-the-art multimodal UAV object detectors.