Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hongjie Huang

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Nov 21, 2024

Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang(+10 more)

Figure 1 for DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Figure 2 for DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Figure 3 for DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Figure 4 for DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Abstract:In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, both improving the previous SOTA performance by 5.8 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.

* Technical Report

Via

Access Paper or Ask Questions

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

May 16, 2024

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen(+6 more)

Figure 1 for Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Figure 2 for Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Figure 3 for Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Figure 4 for Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Abstract:This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment. The Grounding DINO 1.5 Pro model advances its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, thereby achieving a richer semantic understanding. The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results demonstrate the effectiveness of Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection. Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT, achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. Model examples and demos with API will be released at https://github.com/IDEA-Research/Grounding-DINO-1.5-API

* Technical report

Via

Access Paper or Ask Questions

DetectorNet: Transformer-enhanced Spatial Temporal Graph Neural Network for Traffic Prediction

Oct 19, 2021

He Li, Shiyu Zhang, Xuejiao Li, Liangcai Su, Hongjie Huang, Duo Jin, Linghao Chen, Jianbing Huang, Jaesoo Yoo

Figure 1 for DetectorNet: Transformer-enhanced Spatial Temporal Graph Neural Network for Traffic Prediction

Figure 2 for DetectorNet: Transformer-enhanced Spatial Temporal Graph Neural Network for Traffic Prediction

Figure 3 for DetectorNet: Transformer-enhanced Spatial Temporal Graph Neural Network for Traffic Prediction

Figure 4 for DetectorNet: Transformer-enhanced Spatial Temporal Graph Neural Network for Traffic Prediction

Abstract:Detectors with high coverage have direct and far-reaching benefits for road users in route planning and avoiding traffic congestion, but utilizing these data presents unique challenges including: the dynamic temporal correlation, and the dynamic spatial correlation caused by changes in road conditions. Although the existing work considers the significance of modeling with spatial-temporal correlation, what it has learned is still a static road network structure, which cannot reflect the dynamic changes of roads, and eventually loses much valuable potential information. To address these challenges, we propose DetectorNet enhanced by Transformer. Differs from previous studies, our model contains a Multi-view Temporal Attention module and a Dynamic Attention module, which focus on the long-distance and short-distance temporal correlation, and dynamic spatial correlation by dynamically updating the learned knowledge respectively, so as to make accurate prediction. In addition, the experimental results on two public datasets and the comparison results of four ablation experiments proves that the performance of DetectorNet is better than the eleven advanced baselines.

* The 29th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2021)

Via

Access Paper or Ask Questions