Abstract:Due to the high similarity of disparity between consecutive frames in video sequences, the area where disparity changes is defined as the residual map, which can be calculated. Based on this, we propose RecSM, a network based on residual estimation with a flexible recursive structure for video stereo matching. The RecSM network accelerates stereo matching using a Multi-scale Residual Estimation Module (MREM), which employs the temporal context as a reference and rapidly calculates the disparity for the current frame by computing only the residual values between the current and previous frames. To further reduce the error of estimated disparities, we use the Disparity Optimization Module (DOM) and Temporal Attention Module (TAM) to enforce constraints between each module, and together with MREM, form a flexible Stackable Computation Structure (SCS), which allows for the design of different numbers of SCS based on practical scenarios. Experimental results demonstrate that with a stack count of 3, RecSM achieves a 4x speed improvement compared to ACVNet, running at 0.054 seconds based on one NVIDIA RTX 2080TI GPU, with an accuracy decrease of only 0.7%. Code is available at https://github.com/Y0uchenZ/RecSM.
Abstract:It is believed that human vision system (HVS) consists of pre-attentive process and attention process when performing salient object detection (SOD). Based on this fact, we propose a four-stage framework for SOD, in which the first two stages match the \textbf{P}re-\textbf{A}ttentive process consisting of general feature extraction (GFE) and feature preprocessing (FP), and the last two stages are corresponding to \textbf{A}ttention process containing saliency feature extraction (SFE) and the feature aggregation (FA), namely \textbf{PAANet}. According to the pre-attentive process, the GFE stage applies the fully-trained backbone and needs no further finetuning for different datasets. This modification can greatly increase the training speed. The FP stage plays the role of finetuning but works more efficiently because of its simpler structure and fewer parameters. Moreover, in SFE stage we design for saliency feature extraction a novel contrast operator, which works more semantically in contrast with the traditional convolution operator when extracting the interactive information between the foreground and its surroundings. Interestingly, this contrast operator can be cascaded to form a deeper structure and extract higher-order saliency more effective for complex scene. Comparative experiments with the state-of-the-art methods on 5 datasets demonstrate the effectiveness of our framework.