Abstract:Open Set Object Detection has seen rapid development recently, but it continues to pose significant challenges. Language-based methods, grappling with the substantial modal disparity between textual and visual modalities, require extensive computational resources to bridge this gap. Although integrating visual prompts into these frameworks shows promise for enhancing performance, it always comes with constraints related to textual semantics. In contrast, viusal-only methods suffer from the low-quality fusion of multiple visual prompts. In response, we introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO), which constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps. Our innovative multi-image visual updating mechanism learns to identify the semantic intersections from various visual prompts, enabling the flexible incorporation of new information and continuous optimization of feature representations. Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands compared to language-based methods. Furthermore, the integration of a segmentation head illustrates the broad applicability of visual intersection in various visual tasks. VINO, which requires only 7 RTX4090 GPU days to complete one epoch on the Objects365v1 dataset, achieves competitive performance on par with vision-language models on benchmarks such as LVIS and ODinW35.
Abstract:LiDAR sensors are critical for autonomous driving and robotics applications due to their ability to provide accurate range measurements and their robustness to lighting conditions. However, airborne particles, such as fog, rain, snow, and dust, will degrade its performance and it is inevitable to encounter these inclement environmental conditions outdoors. It would be a straightforward approach to remove them by supervised semantic segmentation. But annotating these particles point wisely is too laborious. To address this problem and enhance the perception under inclement conditions, we develop two dynamic filtering methods called Dynamic Multi-threshold Noise Removal (DMNR) and DMNR-H by accurate analysis of the position distribution and intensity characteristics of noisy points and clean points on publicly available WADS and DENSE datasets. Both DMNR and DMNR-H outperform state-of-the-art unsupervised methods by a significant margin on the two datasets and are slightly better than supervised deep learning-based methods. Furthermore, our methods are more robust to different LiDAR sensors and airborne particles, such as snow and fog.