Abstract:Vision camera and sonar are naturally complementary in the underwater environment. Combining the information from two modalities will promote better observation of underwater targets. However, this problem has not received sufficient attention in previous research. Therefore, this paper introduces a new challenging RGB-Sonar (RGB-S) tracking task and investigates how to achieve efficient tracking of an underwater target through the interaction of RGB and sonar modalities. Specifically, we first propose an RGBS50 benchmark dataset containing 50 sequences and more than 87000 high-quality annotated bounding boxes. Experimental results show that the RGBS50 benchmark poses a challenge to currently popular SOT trackers. Second, we propose an RGB-S tracker called SCANet, which includes a spatial cross-attention module (SCAM) consisting of a novel spatial cross-attention layer and two independent global integration modules. The spatial cross-attention is used to overcome the problem of spatial misalignment of between RGB and sonar images. Third, we propose a SOT data-based RGB-S simulation training method (SRST) to overcome the lack of RGB-S training datasets. It converts RGB images into sonar-like saliency images to construct pseudo-data pairs, enabling the model to learn the semantic structure of RGB-S-like data. Comprehensive experiments show that the proposed spatial cross-attention effectively achieves the interaction between RGB and sonar modalities and SCANet achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/RGBS50.
Abstract:Bipartite graph hashing (BGH) is extensively used for Top-K search in Hamming space at low storage and inference costs. Recent research adopts graph convolutional hashing for BGH and has achieved the state-of-the-art performance. However, the contributions of its various influencing factors to hashing performance have not been explored in-depth, including the same/different sign count between two binary embeddings during Hamming space search (sign property), the contribution of sub-embeddings at each layer (model property), the contribution of different node types in the bipartite graph (node property), and the combination of augmentation methods. In this work, we build a lightweight graph convolutional hashing model named LightGCH by mainly removing the augmentation methods of the state-of-the-art model BGCH. By analyzing the contributions of each layer and node type to performance, as well as analyzing the Hamming similarity statistics at each layer, we find that the actual neighbors in the bipartite graph tend to have low Hamming similarity at the shallow layer, and all nodes tend to have high Hamming similarity at the deep layers in LightGCH. To tackle these problems, we propose a novel sign-guided framework SGBGH to make improvement, which uses sign-guided negative sampling to improve the Hamming similarity of neighbors, and uses sign-aware contrastive learning to help nodes learn more uniform representations. Experimental results show that SGBGH outperforms BGCH and LightGCH significantly in embedding quality.
Abstract:Although single object trackers have achieved advanced performance, their large-scale models make it difficult to apply them on the platforms with limited resources. Moreover, existing lightweight trackers only achieve balance between 2-3 points in terms of parameters, performance, Flops and FPS. To achieve the optimal balance among these points, this paper propose a lightweight full-convolutional Siamese tracker called LightFC. LightFC employs a novel efficient cross-correlation module (ECM) and a novel efficient rep-center head (ERH) to enhance the nonlinear expressiveness of the convolutional tracking pipeline. The ECM employs an attention-like module design, which conducts spatial and channel linear fusion of fused features and enhances the nonlinearly of the fused features. Additionally, it references successful factors of current lightweight trackers and introduces skip-connections and reuse of search area features. The ERH reparameterizes the feature dimensional stage in the standard center head and introduces channel attention to optimize the bottleneck of key feature flows. Comprehensive experiments show that LightFC achieves the optimal balance between performance, parameters, Flops and FPS. The precision score of LightFC outperforms MixFormerV2-S by 3.7 \% and 6.5 \% on LaSOT and TNL2K, respectively, while using 5x fewer parameters and 4.6x fewer Flops. Besides, LightFC runs 2x faster than MixFormerV2-S on CPUs. Our code and raw results can be found at https://github.com/LiYunfengLYF/LightFC