Abstract:Unmanned aerial vehicle (UAV) remote sensing is widely applied in fields such as emergency response, owing to its advantages of rapid information acquisition and low cost. However, due to the effects of shooting distance and imaging mechanisms, the objects in the images present challenges such as small size, dense distribution, and low inter-class differentiation. To this end, we propose a multimodal remote sensing detection network that employs a quad-directional selective scanning fusion strategy called RemoteDet-Mamba. RemoteDet-Mamba simultaneously facilitates the learning of single-modal local features and the integration of patch-level global features across modalities, enhancing the distinguishability for small objects and utilizing local information to improve discrimination between different classes. Additionally, the use of Mamba's serial processing significantly increases detection speed. Experimental results on the DroneVehicle dataset demonstrate the effectiveness of RemoteDet-Mamba, which achieves superior detection accuracy compared to state-of-the-art methods while maintaining computational efficiency and parameter count.