Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shun Dai

FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection

Nov 14, 2025

Wencong Wu, Xiuwei Zhang, Hanlin Yin, Shun Dai, Hongxi Zhang, Yanning Zhang

Figure 1 for FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection

Figure 2 for FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection

Figure 3 for FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection

Figure 4 for FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection

Abstract:Visible-infrared object detection has gained sufficient attention due to its detection performance in low light, fog, and rain conditions. However, visible and infrared modalities captured by different sensors exist the information imbalance problem in complex scenarios, which can cause inadequate cross-modal fusion, resulting in degraded detection performance. \textcolor{red}{Furthermore, most existing methods use transformers in the spatial domain to capture complementary features, ignoring the advantages of developing frequency domain transformers to mine complementary information.} To solve these weaknesses, we propose a frequency domain fusion transformer, called FreDFT, for visible-infrared object detection. The proposed approach employs a novel multimodal frequency domain attention (MFDA) to mine complementary information between modalities and a frequency domain feed-forward layer (FDFFL) via a mixed-scale frequency feature fusion strategy is designed to better enhance multimodal features. To eliminate the imbalance of multimodal information, a cross-modal global modeling module (CGMM) is constructed to perform pixel-wise inter-modal feature interaction in a spatial and channel manner. Moreover, a local feature enhancement module (LFEM) is developed to strengthen multimodal local feature representation and promote multimodal feature fusion by using various convolution layers and applying a channel shuffle. Extensive experimental results have verified that our proposed FreDFT achieves excellent performance on multiple public datasets compared with other state-of-the-art methods. The code of our FreDFT is linked at https://github.com/WenCongWu/FreDFT.

Via

Access Paper or Ask Questions

RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

May 29, 2024

Jinzhong Wang, Xuetao Tian, Shun Dai, Tao Zhuo, Haorui Zeng, Hongjuan Liu, Jiaqi Liu, Xiuwei Zhang, Yanning Zhang

Figure 1 for RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

Figure 2 for RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

Figure 3 for RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

Figure 4 for RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

Abstract:Multispectral object detection, utilizing both visible (RGB) and thermal infrared (T) modals, has garnered significant attention for its robust performance across diverse weather and lighting conditions. However, effectively exploiting the complementarity between RGB-T modals while maintaining efficiency remains a critical challenge. In this paper, a very simple Group Shuffled Multi-receptive Attention (GSMA) module is proposed to extract and combine multi-scale RGB and thermal features. Then, the extracted multi-modal features are directly integrated with a multi-level path aggregation neck, which significantly improves the fusion effect and efficiency. Meanwhile, multi-modal object detection often adopts union annotations for both modals. This kind of supervision is not sufficient and unfair, since objects observed in one modal may not be seen in the other modal. To solve this issue, Multi-modal Supervision (MS) is proposed to sufficiently supervise RGB-T object detection. Comprehensive experiments on two challenging benchmarks, KAIST and DroneVehicle, demonstrate the proposed model achieves the state-of-the-art accuracy while maintaining competitive efficiency.

Via

Access Paper or Ask Questions