Abstract:Current LiDAR point cloud-based 3D single object tracking (SOT) methods typically rely on point-based representation network. Despite demonstrated success, such networks suffer from some fundamental problems: 1) It contains pooling operation to cope with inherently disordered point clouds, hindering the capture of 3D spatial information that is useful for tracking, a regression task. 2) The adopted set abstraction operation hardly handles density-inconsistent point clouds, also preventing 3D spatial information from being modeled. To solve these problems, we introduce a novel tracking framework, termed VoxelTrack. By voxelizing inherently disordered point clouds into 3D voxels and extracting their features via sparse convolution blocks, VoxelTrack effectively models precise and robust 3D spatial information, thereby guiding accurate position prediction for tracked objects. Moreover, VoxelTrack incorporates a dual-stream encoder with cross-iterative feature fusion module to further explore fine-grained 3D spatial information for tracking. Benefiting from accurate 3D spatial information being modeled, our VoxelTrack simplifies tracking pipeline with a single regression loss. Extensive experiments are conducted on three widely-adopted datasets including KITTI, NuScenes and Waymo Open Dataset. The experimental results confirm that VoxelTrack achieves state-of-the-art performance (88.3%, 71.4% and 63.6% mean precision on the three datasets, respectively), and outperforms the existing trackers with a real-time speed of 36 Fps on a single TITAN RTX GPU. The source code and model will be released.
Abstract:Current 3D single object tracking methods primarily rely on the Siamese matching-based paradigm, which struggles with textureless and incomplete LiDAR point clouds. Conversely, the motion-centric paradigm avoids appearance matching, thus overcoming these issues. However, its complex multi-stage pipeline and the limited temporal modeling capability of a single-stream architecture constrain its potential. In this paper, we introduce SiamMo, a novel and simple Siamese motion-centric tracking approach. Unlike the traditional single-stream architecture, we employ Siamese feature extraction for motion-centric tracking. This decouples feature extraction from temporal fusion, significantly enhancing tracking performance. Additionally, we design a Spatio-Temporal Feature Aggregation module to integrate Siamese features at multiple scales, capturing motion information effectively. We also introduce a Box-aware Feature Encoding module to encode object size priors into motion estimation. SiamMo is a purely motion-centric tracker that eliminates the need for additional processes like segmentation and box refinement. Without whistles and bells, SiamMo not only surpasses state-of-the-art methods across multiple benchmarks but also demonstrates exceptional robustness in challenging scenarios. SiamMo sets a new record on the KITTI tracking benchmark with 90.1\% precision while maintaining a high inference speed of 108 FPS. The code will be released at https://github.com/HDU-VRLab/SiamMo.