Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Aug 22, 2024

Guoting Wei, Xia Yuan, Yu Liu, Zhenhao Shang, Kelu Yao, Chao Li, Qingsen Yan, Chunxia Zhao, Haokui Zhang, Rong Xiao

Figure 1 for OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Figure 2 for OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Figure 3 for OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Figure 4 for OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Share this with someone who'll enjoy it:

Abstract:Aerial object detection has been a hot topic for many years due to its wide application requirements. However, most existing approaches can only handle predefined categories, which limits their applicability for the open scenarios in real-world. In this paper, we extend aerial object detection to open scenarios by exploiting the relationship between image and text, and propose OVA-DETR, a high-efficiency open-vocabulary detector for aerial images. Specifically, based on the idea of image-text alignment, we propose region-text contrastive loss to replace the category regression loss in the traditional detection framework, which breaks the category limitation. Then, we propose Bidirectional Vision-Language Fusion (Bi-VLF), which includes a dual-attention fusion encoder and a multi-level text-guided Fusion Decoder. The dual-attention fusion encoder enhances the feature extraction process in the encoder part. The multi-level text-guided Fusion Decoder is designed to improve the detection ability for small objects, which frequently appear in aerial object detection scenarios. Experimental results on three widely used benchmark datasets show that our proposed method significantly improves the mAP and recall, while enjoying faster inference speed. For instance, in zero shot detection experiments on DIOR, the proposed OVA-DETR outperforms DescReg and YOLO-World by 37.4% and 33.1%, respectively, while achieving 87 FPS inference speed, which is 7.9x faster than DescReg and 3x faster than YOLO-world. The code is available at https://github.com/GT-Wei/OVA-DETR.

View paper on

Share this with someone who'll enjoy it:

Title:OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Paper and Code