Object detection is a computer vision task in which the goal is to detect and locate objects of interest in an image or video. The task involves identifying the position and boundaries of objects in an image, and classifying the objects into different categories. It forms a crucial part of vision recognition, alongside image classification and retrieval.
Deep learning-based object detectors have achieved remarkable success across numerous computer vision applications, yet they continue to struggle with small object detection in high-resolution aerial and satellite imagery, where dense object distributions, variable shooting angles, diminutive target sizes, and substantial inter-class variability pose formidable challenges. Existing slicing strategies that partition high-resolution images into manageable patches have demonstrated promising results for enlarging the effective receptive field of small targets; however, their reliance on fixed slice dimensions introduces significant redundant computation, inflating inference cost and undermining detection speed. In this paper, we propose \textbf{Adaptive Slicing-Assisted Hyper Inference (ASAHI)}, a novel slicing framework that shifts the paradigm from prescribing a fixed slice size to adaptively determining the optimal number of slices according to image resolution, thereby substantially mitigating redundant computation while preserving beneficial overlap between adjacent patches. ASAHI integrates three synergistic components: (1)an adaptive resolution-aware slicing algorithm that dynamically generates 6 or 12 overlapping patches based on a learned threshold, (2)a slicing-assisted fine-tuning (SAF) strategy that constructs augmented training data comprising both full-resolution and sliced image patches, and (3)a Cluster-DIoU-NMS (CDN) post-processing module that combines the geometric merging efficiency of Cluster-NMS with the center-distance-aware suppression of DIoU-NMS to achieve robust duplicate elimination in crowded scenes. Extensive experiments on VisDrone2019 and xView, demonstrate that ASAHI achieves state-of-the-art performance with 56.8% on VisDrone2019-DET-val and 22.7% on xView-test, while reducing inference time by 20-25% compared to the baseline SAHI method.
Large vision-language models (LVLMs) have demonstrated impressive performance in various multimodal understanding and reasoning tasks. However, they still struggle with object hallucinations, i.e., the claim of nonexistent objects in the visual input. To address this challenge, we propose Region-aware Chain-of-Verification (R-CoV), a visual chain-of-verification method to alleviate object hallucinations in LVLMs in a post-hoc manner. Motivated by how humans comprehend intricate visual information -- often focusing on specific image regions or details within a given sample -- we elicit such region-level processing from LVLMs themselves and use it as a chaining cue to detect and alleviate their own object hallucinations. Specifically, our R-CoV consists of six steps: initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation. As a simple yet effective method, R-CoV can be seamlessly integrated into various LVLMs in a training-free manner and without relying on external detection models. Extensive experiments on several widely used hallucination benchmarks across multiple LVLMs demonstrate that R-CoV can significantly alleviate object hallucinations in LVLMs. Project page: https://github.com/Jiahao000/R-CoV.
Deep learning has emerged as a powerful approach for malware detection, demonstrating impressive accuracy across various data representations. However, these models face critical limitations in real-world, non-stationary environments where both malware characteristics and detection systems continuously evolve. Our research investigates a fundamental security question: Can an attacker generate adversarial malware samples that simultaneously evade classification and remain inconspicuous to drift monitoring mechanisms? We propose a novel approach that generates targeted adversarial examples in the classifier's standardized feature space, augmented with sophisticated similarity regularizers. By carefully constraining perturbations to maintain distributional similarity with clean malware, we create an optimization objective that balances targeted misclassification with drift signal minimization. We quantify the effectiveness of this approach by comprehensively comparing classifier output probabilities using multiple drift metrics. Our experiments demonstrate that similarity constraints can reduce output drift signals, with $\ell_2$ regularization showing the most promising results. We observe that perturbation budget significantly influences the evasion-detectability trade-off, with increased budget leading to higher attack success rates and more substantial drift indicators.
Mixture-of-Experts (MoE) models provide a structured approach to combining specialized neural networks and offer greater interpretability than conventional ensembles. While MoEs have been successfully applied to image classification and semantic segmentation, their use in object detection remains limited due to challenges in merging dense and structured predictions. In this work, we investigate model-level mixtures of object detectors and analyze their suitability for improving performance and interpretability in object detection. We propose an MoE architecture that combines YOLO-based detectors trained on semantically disjoint data subsets, with a learned gating network that dynamically weights expert contributions. We study different strategies for fusing detection outputs and for training the gating mechanism, including balancing losses to prevent expert collapse. Experiments on the BDD100K dataset demonstrate that the proposed MoE consistently outperforms standard ensemble approaches and provides insights into expert specialization across domains, highlighting model-level MoEs as a viable alternative to traditional ensembling for object detection. Our code is available at https://github.com/KASTEL-MobilityLab/mixtures-of-experts/.
In natural images, object skeletons are used to represent geometric shapes. However, even slight variations in pose or movement can cause noticeable changes in skeleton structure, increasing the difficulty of detecting the skeleton and often resulting in discontinuous skeletons. Existing methods primarily focus on point-level skeleton point detection and overlook the importance of structural continuity in recovering complete skeletons. To address this issue, we propose Lighthouse-Skel, a topology-aware skeleton detection method via lighthouse-guided structured inference. Specifically, we introduce a dual-branch collaborative detection framework that jointly learns skeleton confidence field and structural anchors, including endpoints and junction points. The spatial distributions learned by the point branch guide the network to focus on topologically vulnerable regions, which improves the accuracy of skeleton detection. Based on the learned skeleton confidence field, we further propose a lighthouse-guided topology completion strategy, which uses detected junction points and breakpoints as lighthouses to reconnect discontinuous skeleton segments along low-cost paths, thereby improving skeleton continuity and structural integrity. Experimental results on four public datasets demonstrate that the proposed method achieves competitive detection accuracy while substantially improving skeleton connectivity and structural integrity.
Referring detection refers to locate the target referred by natural languages, which has recently attracted growing research interests. However, existing datasets are limited to ground images with large object centered in relative small scenes. This paper introduces a large-scale challenging dataset for referring detection in aerial images, termed as RefAerial. It distinguishes from conventional ground referring detection datasets by 4 characteristics: (1) low but diverse object-to-scene ratios, (2) numerous targets and distractors, (3)complex and fine-grained referring descriptions, (4) diverse and broad scenes in the aerial view. We also develop a human-in-the-loop referring expansion and annotation engine (REA-Engine) for efficient semi-automated referring pair annotation. Besides, we observe that existing ground referring detection approaches exhibiting serious performance degradation on our aerial dataset since the intrinsic scale variety issue within or across aerial images. Therefore, we further propose a novel scale-comprehensive and sensitive (SCS) framework for referring detection in aerial images. It consists of a mixture-of-granularity (MoG) attention and a two-stage comprehensive-to-sensitive (CtS) decoding strategy. Specifically, the mixture-of-granularity attention is developed for scale-comprehensive target understanding. In addition, the two-stage comprehensive-to-sensitive decoding strategy is designed for coarse-to-fine referring target decoding. Eventually, the proposed SCS framework achieves remarkable performance on our aerial referring detection dataset and even promising performance boost on conventional ground referring detection datasets.
Camouflaged Object Detection is challenging due to the high degree of similarity between camouflaged objects and their surrounding backgrounds. Current COD methods mainly rely on edge extraction in the spatial domain and local pixel-level information, neglecting the importance of global structural features. Additionally, they fail to effectively leverage the importance of phase spectrum information within frequency domain features. To this end, we propose a COD framework BASFNet based on boundary-aware frequency domain and spatial domain fusion.This method uses dual guided integration of frequency domain and spatial domain features. A phase-spectrum-based frequency-enhanced edge exploration module (FEEM) and a spatial core segmentation module (SCSM) are introduced to jointly capture the boundary and object features of camouflaged objects. These features are then effectively integrated through a spatial-frequency fusion interaction module (SFFIM). Furthermore, the boundary detection is further optimized through an boundary-aware training strategy. BASFNet outperforms existing state-of-the-art methods on three benchmark datasets, validating the effectiveness of the fusion of frequency and spatial domain information in COD tasks.
Vision Transformers $(\texttt{ViT})$ have become the architecture of choice for many computer vision tasks, yet their performance in computer-aided diagnostics remains limited. Focusing on breast cancer detection from mammograms, we identify two main causes for this shortfall. First, medical images are high-resolution with small abnormalities, leading to an excessive number of tokens and making it difficult for the softmax-based attention to localize and attend to relevant regions. Second, medical image classification is inherently fine-grained, with low inter-class and high intra-class variability, where standard cross-entropy training is insufficient. To overcome these challenges, we propose a framework with three key components: (1) Region of interest $(\texttt{RoI})$ based token reduction using an object detection model to guide attention; (2) contrastive learning between selected $\texttt{RoI}$ to enhance fine-grained discrimination through hard-negative based training; and (3) a $\texttt{DINOv2}$ pretrained $\texttt{ViT}$ that captures localization-aware, fine-grained features instead of global $\texttt{CLIP}$ representations. Experiments on public mammography datasets demonstrate that our method achieves superior performance over existing baselines, establishing its effectiveness and potential clinical utility for large-scale breast cancer screening. Our code is available for reproducibility here: https://aih-iitd.github.io/publications/attend-what-matters
Despite advances in object detection, aerial imagery remains a challenging domain, as models often fail to generalize across variations in spatial resolution, scene composition, and semantic label coverage. Differences in geographic context, sensor characteristics, and object distributions across datasets limit the capacity of conventional models to learn consistent and transferable representations. Shared methods trained on such data tend to impose a unified representation across fundamentally different domains, resulting in poor performance on region-specific content and less flexibility when dealing with novel object categories. To address this, we propose a novel modular learning framework that enables structured specialization in aerial detection. Our method introduces a hierarchical routing mechanism with two levels of modularity: a global expert assignment layer that uses latent geographic embeddings to route datasets to specialized processing modules, and a local scene decomposition mechanism that allocates image subregions to region-specific sub-modules. This allows our method to specialize across datasets and within complex scenes. Additionally, the framework contains a conditional expert module that uses external semantic information (e.g., category names or textual descriptions) to enable detection of novel object categories during inference, without the need for retraining or fine-tuning. By moving beyond monolithic representations, our method offers an adaptive framework for remote sensing object detection. Comprehensive evaluations on four datasets highlight improvements in multi-dataset generalization, regional specialization, and open-category detection.
Diffusion-based image synthesis has emerged as a promising source of synthetic training data for AI-based object detection and classification. In this work, we investigate whether images generated with diffusion can improve military vehicle detection under low-data conditions. We fine-tuned the text-to-image diffusion model FLUX.1 [dev] using LoRA with only 8 or 24 real images per class across 15 vehicle categories, resulting in class-specific diffusion models, which were used to generate new samples from automatically generated text prompts. The same real images were used to fine-tune the RF-DETR detector for a 15-class object detection task. Synthetic datasets generated by the diffusion models were then used to further improve detector performance. Importantly, no additional real data was required, as the generative models leveraged the same limited training samples. FLUX-generated images improved detection performance, particularly in the low-data regime (up to +8.0% mAP$_{50}$ with 8 real samples). To address the limited geometric control of text prompt-based diffusion, we additionally generated structurally guided synthetic data using ControlNet with Canny edge-map conditioning, yielding a FLUX-ControlNet (FLUX-CN) dataset with explicit control over viewpoint and pose. Structural guidance further enhanced performance when data is scarce (+4.1% mAP$_{50}$ with 8 real samples), but no additional benefit was observed when more real data is available. This study demonstrates that object-specific diffusion models are effective for improving military object detection in a low-data domain, and that structural guidance is most beneficial when real data is highly limited. These results highlight generative image data as an alternative to traditional simulation pipelines for the training of military AI systems.