Topic:Dense Object Detection
What is Dense Object Detection? Dense object detection is the process of detecting and localizing objects in images with dense annotations.
Papers and Code
Apr 21, 2025
Abstract:Aerial object detection using unmanned aerial vehicles (UAVs) faces critical challenges including sub-10px targets, dense occlusions, and stringent computational constraints. Existing detectors struggle to balance accuracy and efficiency due to rigid receptive fields and redundant architectures. To address these limitations, we propose Variable Receptive Field DETR (VRF-DETR), a transformer-based detector incorporating three key components: 1) Multi-Scale Context Fusion (MSCF) module that dynamically recalibrates features through adaptive spatial attention and gated multi-scale fusion, 2) Gated Convolution (GConv) layer enabling parameter-efficient local-context modeling via depthwise separable operations and dynamic gating, and 3) Gated Multi-scale Fusion (GMCF) Bottleneck that hierarchically disentangles occluded objects through cascaded global-local interactions. Experiments on VisDrone2019 demonstrate VRF-DETR achieves 51.4\% mAP\textsubscript{50} and 31.8\% mAP\textsubscript{50:95} with only 13.5M parameters. This work establishes a new efficiency-accuracy Pareto frontier for UAV-based detection tasks.
Via

Apr 18, 2025
Abstract:Vision Transformer (ViT) has achieved remarkable results in object detection for synthetic aperture radar (SAR) images, owing to its exceptional ability to extract global features. However, it struggles with the extraction of multi-scale local features, leading to limited performance in detecting small targets, especially when they are densely arranged. Therefore, we propose Density-Sensitive Vision Transformer with Adaptive Tokens (DenSe-AdViT) for dense SAR target detection. We design a Density-Aware Module (DAM) as a preliminary component that generates a density tensor based on target distribution. It is guided by a meticulously crafted objective metric, enabling precise and effective capture of the spatial distribution and density of objects. To integrate the multi-scale information enhanced by convolutional neural networks (CNNs) with the global features derived from the Transformer, Density-Enhanced Fusion Module (DEFM) is proposed. It effectively refines attention toward target-survival regions with the assist of density mask and the multiple sources features. Notably, our DenSe-AdViT achieves 79.8% mAP on the RSDD dataset and 92.5% on the SIVED dataset, both of which feature a large number of densely distributed vehicle targets.
Via

Apr 17, 2025
Abstract:We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.
* Initial Submission
Via

Apr 15, 2025
Abstract:Although fully-supervised oriented object detection has made significant progress in multimodal remote sensing image understanding, it comes at the cost of labor-intensive annotation. Recent studies have explored weakly and semi-supervised learning to alleviate this burden. However, these methods overlook the difficulties posed by dense annotations in complex remote sensing scenes. In this paper, we introduce a novel setting called sparsely annotated oriented object detection (SAOOD), which only labels partial instances, and propose a solution to address its challenges. Specifically, we focus on two key issues in the setting: (1) sparse labeling leading to overfitting on limited foreground representations, and (2) unlabeled objects (false negatives) confusing feature learning. To this end, we propose the S$^2$Teacher, a novel method that progressively mines pseudo-labels for unlabeled objects, from easy to hard, to enhance foreground representations. Additionally, it reweights the loss of unlabeled objects to mitigate their impact during training. Extensive experiments demonstrate that S$^2$Teacher not only significantly improves detector performance across different sparse annotation levels but also achieves near-fully-supervised performance on the DOTA dataset with only 10% annotation instances, effectively balancing detection accuracy with annotation efficiency. The code will be public.
Via

Apr 09, 2025
Abstract:Robust grasping in cluttered environments remains an open challenge in robotics. While benchmark datasets have significantly advanced deep learning methods, they mainly focus on simplistic scenes with light occlusion and insufficient diversity, limiting their applicability to practical scenarios. We present GraspClutter6D, a large-scale real-world grasping dataset featuring: (1) 1,000 highly cluttered scenes with dense arrangements (14.1 objects/scene, 62.6\% occlusion), (2) comprehensive coverage across 200 objects in 75 environment configurations (bins, shelves, and tables) captured using four RGB-D cameras from multiple viewpoints, and (3) rich annotations including 736K 6D object poses and 9.3B feasible robotic grasps for 52K RGB-D images. We benchmark state-of-the-art segmentation, object pose estimation, and grasping detection methods to provide key insights into challenges in cluttered environments. Additionally, we validate the dataset's effectiveness as a training resource, demonstrating that grasping networks trained on GraspClutter6D significantly outperform those trained on existing datasets in both simulation and real-world experiments. The dataset, toolkit, and annotation tools are publicly available on our project website: https://sites.google.com/view/graspclutter6d.
Via

Apr 07, 2025
Abstract:Reliable 3D object perception is essential in autonomous driving. Owing to its sensing capabilities in all weather conditions, 4D radar has recently received much attention. However, compared to LiDAR, 4D radar provides much sparser point cloud. In this paper, we propose a 3D object detection method, termed ZFusion, which fuses 4D radar and vision modality. As the core of ZFusion, our proposed FP-DDCA (Feature Pyramid-Double Deformable Cross Attention) fuser complements the (sparse) radar information and (dense) vision information, effectively. Specifically, with a feature-pyramid structure, the FP-DDCA fuser packs Transformer blocks to interactively fuse multi-modal features at different scales, thus enhancing perception accuracy. In addition, we utilize the Depth-Context-Split view transformation module due to the physical properties of 4D radar. Considering that 4D radar has a much lower cost than LiDAR, ZFusion is an attractive alternative to LiDAR-based methods. In typical traffic scenarios like the VoD (View-of-Delft) dataset, experiments show that with reasonable inference speed, ZFusion achieved the state-of-the-art mAP (mean average precision) in the region of interest, while having competitive mAP in the entire area compared to the baseline methods, which demonstrates performance close to LiDAR and greatly outperforms those camera-only methods.
* CVPR 2025 WDFM-AD
Via

Apr 07, 2025
Abstract:Multimodal 3D object detection based on deep neural networks has indeed made significant progress. However, it still faces challenges due to the misalignment of scale and spatial information between features extracted from 2D images and those derived from 3D point clouds. Existing methods usually aggregate multimodal features at a single stage. However, leveraging multi-stage cross-modal features is crucial for detecting objects of various scales. Therefore, these methods often struggle to integrate features across different scales and modalities effectively, thereby restricting the accuracy of detection. Additionally, the time-consuming Query-Key-Value-based (QKV-based) cross-attention operations often utilized in existing methods aid in reasoning the location and existence of objects by capturing non-local contexts. However, this approach tends to increase computational complexity. To address these challenges, we present SSLFusion, a novel Scale & Space Aligned Latent Fusion Model, consisting of a scale-aligned fusion strategy (SAF), a 3D-to-2D space alignment module (SAM), and a latent cross-modal fusion module (LFM). SAF mitigates scale misalignment between modalities by aggregating features from both images and point clouds across multiple levels. SAM is designed to reduce the inter-modal gap between features from images and point clouds by incorporating 3D coordinate information into 2D image features. Additionally, LFM captures cross-modal non-local contexts in the latent space without utilizing the QKV-based attention operations, thus mitigating computational complexity. Experiments on the KITTI and DENSE datasets demonstrate that our SSLFusion outperforms state-of-the-art methods. Our approach obtains an absolute gain of 2.15% in 3D AP, compared with the state-of-art method GraphAlign on the moderate level of the KITTI test set.
* Accepted by AAAI 2025
Via

Mar 30, 2025
Abstract:Semi-supervised semantic segmentation (SS-SS) aims to mitigate the heavy annotation burden of dense pixel labeling by leveraging abundant unlabeled images alongside a small labeled set. While current teacher-student consistency regularization methods achieve strong results, they often overlook a critical challenge: the precise delineation of object boundaries. In this paper, we propose BoundMatch, a novel multi-task SS-SS framework that explicitly integrates semantic boundary detection into the consistency regularization pipeline. Our core mechanism, Boundary Consistency Regularized Multi-Task Learning (BCRM), enforces prediction agreement between teacher and student models on both segmentation masks and detailed semantic boundaries. To further enhance performance and sharpen contours, BoundMatch incorporates two lightweight fusion modules: Boundary-Semantic Fusion (BSF) injects learned boundary cues into the segmentation decoder, while Spatial Gradient Fusion (SGF) refines boundary predictions using mask gradients, leading to higher-quality boundary pseudo-labels. This framework is built upon SAMTH, a strong teacher-student baseline featuring a Harmonious Batch Normalization (HBN) update strategy for improved stability. Extensive experiments on diverse datasets including Cityscapes, BDD100K, SYNTHIA, ADE20K, and Pascal VOC show that BoundMatch achieves competitive performance against state-of-the-art methods while significantly improving boundary-specific evaluation metrics. We also demonstrate its effectiveness in realistic large-scale unlabeled data scenarios and on lightweight architectures designed for mobile deployment.
* 15 pages, 7 figures
Via

Apr 01, 2025
Abstract:Background. Infectious diseases, particularly COVID-19, continue to be a significant global health issue. Although many countries have reduced or stopped large-scale testing measures, the detection of such diseases remains a propriety. Objective. This study aims to develop a novel, lightweight deep neural network for efficient, accurate, and cost-effective detection of COVID-19 using a nasal breathing audio data collected via smartphones. Methodology. Nasal breathing audio from 128 patients diagnosed with the Omicron variant was collected. Mel-Frequency Cepstral Coefficients (MFCCs), a widely used feature in speech and sound analysis, were employed for extracting important characteristics from the audio signals. Additional feature selection was performed using Random Forest (RF) and Principal Component Analysis (PCA) for dimensionality reduction. A Dense-ReLU-Dropout model was trained with K-fold cross-validation (K=3), and performance metrics like accuracy, precision, recall, and F1-score were used to evaluate the model. Results. The proposed model achieved 97% accuracy in detecting COVID-19 from nasal breathing sounds, outperforming state-of-the-art methods such as those by [23] and [13]. Our Dense-ReLU-Dropout model, using RF and PCA for feature selection, achieves high accuracy with greater computational efficiency compared to existing methods that require more complex models or larger datasets. Conclusion. The findings suggest that the proposed method holds significant potential for clinical implementation, advancing smartphone-based diagnostics in infectious diseases. The Dense-ReLU-Dropout model, combined with innovative feature processing techniques, offers a promising approach for efficient and accurate COVID-19 detection, showcasing the capabilities of mobile device-based diagnostics
* 14 pages, 5 figures, 6 tables
Via

Mar 31, 2025
Abstract:Wheat plays a critical role in global food security, making it one of the most extensively studied crops. Accurate identification and measurement of key characteristics of wheat heads are essential for breeders to select varieties for cross-breeding, with the goal of developing nutrient-dense, resilient, and sustainable cultivars. Traditionally, these measurements are performed manually, which is both time-consuming and inefficient. Advances in digital technologies have paved the way for automating this process. However, field conditions pose significant challenges, such as occlusions of leaves, overlapping wheat heads, varying lighting conditions, and motion blur. In this paper, we propose a novel data augmentation technique, BBoxCut, which uses random localized masking to simulate occlusions caused by leaves and neighboring wheat heads. We evaluated our approach using three state-of-the-art object detectors and observed mean average precision (mAP) gains of 2.76, 3.26, and 1.9 for Faster R-CNN, FCOS, and DETR, respectively. Our augmentation technique led to significant improvements both qualitatively and quantitatively. In particular, the improvements were particularly evident in scenarios involving occluded wheat heads, demonstrating the robustness of our method in challenging field conditions.
Via
