Abstract:Video anomaly detection (VAD) has been extensively researched due to its potential for intelligent video systems. However, most existing methods based on CNNs and transformers still suffer from substantial computational burdens and have room for improvement in learning spatial-temporal normality. Recently, Mamba has shown great potential for modeling long-range dependencies with linear complexity, providing an effective solution to the above dilemma. To this end, we propose a lightweight and effective Mamba-based network named STNMamba, which incorporates carefully designed Mamba modules to enhance the learning of spatial-temporal normality. Firstly, we develop a dual-encoder architecture, where the spatial encoder equipped with Multi-Scale Vision Space State Blocks (MS-VSSB) extracts multi-scale appearance features, and the temporal encoder employs Channel-Aware Vision Space State Blocks (CA-VSSB) to capture significant motion patterns. Secondly, a Spatial-Temporal Interaction Module (STIM) is introduced to integrate spatial and temporal information across multiple levels, enabling effective modeling of intrinsic spatial-temporal consistency. Within this module, the Spatial-Temporal Fusion Block (STFB) is proposed to fuse the spatial and temporal features into a unified feature space, and the memory bank is utilized to store spatial-temporal prototypes of normal patterns, restricting the model's ability to represent anomalies. Extensive experiments on three benchmark datasets demonstrate that our STNMamba achieves competitive performance with fewer parameters and lower computational costs than existing methods.
Abstract:Video anomaly detection (VAD) is an essential yet challenge task in signal processing. Since certain anomalies cannot be detected by analyzing temporal or spatial information alone, the interaction between two types of information is considered crucial for VAD. However, current dual-stream architectures either limit interaction between the two types of information to the bottleneck of autoencoder or incorporate background pixels irrelevant to anomalies into the interaction. To this end, we propose a multi-scale spatial-temporal interaction network (MSTI-Net) for VAD. First, to pay particular attention to objects and reconcile the significant semantic differences between the two information, we propose an attention-based spatial-temporal fusion module (ASTM) as a substitute for the conventional direct fusion. Furthermore, we inject multi ASTM-based connections between the appearance and motion pathways of a dual stream network to facilitate spatial-temporal interaction at all possible scales. Finally, the regular information learned from multiple scales is recorded in memory to enhance the differentiation between anomalies and normal events during the testing phase. Solid experimental results on three standard datasets validate the effectiveness of our approach, which achieve AUCs of 96.8% for UCSD Ped2, 87.6% for CUHK Avenue, and 73.9% for the ShanghaiTech dataset.