Abstract:With the rapid development of large vision language models (LVLMs), these models have shown excellent results in various multimodal tasks. Since LVLMs are prone to hallucinations and there are currently few datasets and evaluation methods specifically designed for remote sensing, their performance is typically poor when applied to remote sensing tasks. To address these issues, this paper introduces a high quality remote sensing LVLMs dataset, DDFAV, created using data augmentation and data mixing strategies. Next, a training instruction set is produced based on some high-quality remote sensing images selected from the proposed dataset. Finally, we develop a remote sensing LVLMs hallucination evaluation method RSPOPE based on the proposed dataset and evaluate the zero-shot capabilities of different LVLMs. Our proposed dataset, instruction set, and evaluation method files are available at https://github.com/HaodongLi2024/rspope.
Abstract:The detection of small objects in aerial images is a fundamental task in the field of computer vision. Moving objects in aerial photography have problems such as different shapes and sizes, dense overlap, occlusion by the background, and object blur, however, the original YOLO algorithm has low overall detection accuracy due to its weak ability to perceive targets of different scales. In order to improve the detection accuracy of densely overlapping small targets and fuzzy targets, this paper proposes a dynamic-attention scale-sequence fusion algorithm (DASSF) for small target detection in aerial images. First, we propose a dynamic scale sequence feature fusion (DSSFF) module that improves the up-sampling mechanism and reduces computational load. Secondly, a x-small object detection head is specially added to enhance the detection capability of small targets. Finally, in order to improve the expressive ability of targets of different types and sizes, we use the dynamic head (DyHead). The model we proposed solves the problem of small target detection in aerial images and can be applied to multiple different versions of the YOLO algorithm, which is universal. Experimental results show that when the DASSF method is applied to YOLOv8, compared to YOLOv8n, on the VisDrone-2019 and DIOR datasets, the model shows an increase of 9.2% and 2.4% in the mean average precision (mAP), respectively, and outperforms the current mainstream methods.
Abstract:Recent advancements in neuroscience research have propelled the development of Spiking Neural Networks (SNNs), which not only have the potential to further advance neuroscience research but also serve as an energy-efficient alternative to Artificial Neural Networks (ANNs) due to their spike-driven characteristics. However, previous studies often neglected the multiscale information and its spatiotemporal correlation between event data, leading SNN models to approximate each frame of input events as static images. We hypothesize that this oversimplification significantly contributes to the performance gap between SNNs and traditional ANNs. To address this issue, we have designed a Spiking Multiscale Attention (SMA) module that captures multiscale spatiotemporal interaction information. Furthermore, we developed a regularization method named Attention ZoneOut (AZO), which utilizes spatiotemporal attention weights to reduce the model's generalization error through pseudo-ensemble training. Our approach has achieved state-of-the-art results on mainstream neural morphology datasets. Additionally, we have reached a performance of 77.1% on the Imagenet-1K dataset using a 104-layer ResNet architecture enhanced with SMA and AZO. This achievement confirms the state-of-the-art performance of SNNs with non-transformer architectures and underscores the effectiveness of our method in bridging the performance gap between SNN models and traditional ANN models.
Abstract:Spiking Neural Networks (SNNs) have garnered substantial attention in brain-like computing for their biological fidelity and the capacity to execute energy-efficient spike-driven operations. As the demand for heightened performance in SNNs surges, the trend towards training deeper networks becomes imperative, while residual learning stands as a pivotal method for training deep neural networks. In our investigation, we identified that the SEW-ResNet, a prominent representative of deep residual spiking neural networks, incorporates non-event-driven operations. To rectify this, we introduce the OR Residual connection (ORRC) to the architecture. Additionally, we propose the Synergistic Attention (SynA) module, an amalgamation of the Inhibitory Attention (IA) module and the Multi-dimensional Attention (MA) module, to offset energy loss stemming from high quantization. When integrating SynA into the network, we observed the phenomenon of "natural pruning", where after training, some or all of the shortcuts in the network naturally drop out without affecting the model's classification accuracy. This significantly reduces computational overhead and makes it more suitable for deployment on edge devices. Experimental results on various public datasets confirmed that the SynA enhanced OR-Spiking ResNet achieved single-sample classification with as little as 0.8 spikes per neuron. Moreover, when compared to other spike residual models, it exhibited higher accuracy and lower power consumption. Codes are available at https://github.com/Ym-Shan/ORRC-SynA-natural-pruning.