Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training

Sep 10, 2022

Tiancheng Zhao, Peng Liu, Xiaopeng Lu, Kyusong Lee

Figure 1 for OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training

Figure 2 for OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training

Figure 3 for OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training

Figure 4 for OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training

Share this with someone who'll enjoy it:

Abstract:Advancing object detection to open-vocabulary and few-shot transfer has long been a challenge for computer vision research. This work explores a continual learning approach that enables a detector to expand its zero/few-shot capabilities via multi-dataset vision-language pre-training. Using natural language as knowledge representation, we explore methods to accumulate "visual vocabulary" from different training datasets and unify the task as a language-conditioned detection framework. Specifically, we propose a novel language-aware detector OmDet and a novel training mechanism. The proposed multimodal detection network can resolve the technical challenges in multi-dataset joint training and it can generalize to arbitrary number of training datasets without the requirements for manual label taxonomy merging. Experiment results on COCO, Pascal VOC, and Wider Face/Pedestrian confirmed the efficacy by achieving on par or higher scores in joint training compared to training separately. Moreover, we pre-train on more than 20 million images with 4 million unique object vocabulary, and the resulting model is evaluated on 35 downstream tasks of ODinW. Results show that OmDet is able to achieve the state-of-the-art fine-tuned performance on ODinW. And analysis shows that by scaling up the proposed pre-training method, OmDet continues to improve its zero/few-shot tuning performance, suggesting a promising way for further scaling.

View paper on

Share this with someone who'll enjoy it:

Title:OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training

Paper and Code