Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wu Wei

EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

Feb 21, 2024

Zhendong Xiao, Changhao Chen, Shan Yang, Wu Wei

Figure 1 for EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

Figure 2 for EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

Figure 3 for EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

Figure 4 for EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

Abstract:Camera relocalization is pivotal in computer vision, with applications in AR, drones, robotics, and autonomous driving. It estimates 3D camera position and orientation (6-DoF) from images. Unlike traditional methods like SLAM, recent strides use deep learning for direct end-to-end pose estimation. We propose EffLoc, a novel efficient Vision Transformer for single-image camera relocalization. EffLoc's hierarchical layout, memory-bound self-attention, and feed-forward layers boost memory efficiency and inter-channel communication. Our introduced sequential group attention (SGA) module enhances computational efficiency by diversifying input features, reducing redundancy, and expanding model capacity. EffLoc excels in efficiency and accuracy, outperforming prior methods, such as AtLoc and MapNet. It thrives on large-scale outdoor car-driving scenario, ensuring simplicity, end-to-end trainability, and eliminating handcrafted loss functions.

* 8 pages, 6 figures, ICRA 2024 accepted

Via

Access Paper or Ask Questions

CI-Net: Contextual Information for Joint Semantic Segmentation and Depth Estimation

Jul 29, 2021

Tianxiao Gao, Wu Wei, Zhongbin Cai, Zhun Fan, Shane Xie, Xinmei Wang, Qiuda Yu

Figure 1 for CI-Net: Contextual Information for Joint Semantic Segmentation and Depth Estimation

Figure 2 for CI-Net: Contextual Information for Joint Semantic Segmentation and Depth Estimation

Figure 3 for CI-Net: Contextual Information for Joint Semantic Segmentation and Depth Estimation

Figure 4 for CI-Net: Contextual Information for Joint Semantic Segmentation and Depth Estimation

Abstract:Monocular depth estimation and semantic segmentation are two fundamental goals of scene understanding. Due to the advantages of task interaction, many works study the joint task learning algorithm. However, most existing methods fail to fully leverage the semantic labels, ignoring the provided context structures and only using them to supervise the prediction of segmentation split. In this paper, we propose a network injected with contextual information (CI-Net) to solve the problem. Specifically, we introduce self-attention block in the encoder to generate attention map. With supervision from the ground truth created by semantic labels, the network is embedded with contextual information so that it could understand the scene better, utilizing dependent features to make accurate prediction. Besides, a feature sharing module is constructed to make the task-specific features deeply fused and a consistency loss is devised to make the features mutually guided. We evaluate the proposed CI-Net on the NYU-Depth-v2 and SUN-RGBD datasets. The experimental results validate that our proposed CI-Net is competitive with the state-of-the-arts.

* 10 pages, 9 figures

Via

Access Paper or Ask Questions

Video Affective Effects Prediction with Multi-modal Fusion and Shot-Long Temporal Context

Sep 01, 2019

Jie Zhang, Yin Zhao, Longjun Cai, Chaoping Tu, Wu Wei

Figure 1 for Video Affective Effects Prediction with Multi-modal Fusion and Shot-Long Temporal Context

Figure 2 for Video Affective Effects Prediction with Multi-modal Fusion and Shot-Long Temporal Context

Figure 3 for Video Affective Effects Prediction with Multi-modal Fusion and Shot-Long Temporal Context

Figure 4 for Video Affective Effects Prediction with Multi-modal Fusion and Shot-Long Temporal Context

Abstract:Predicting the emotional impact of videos using machine learning is a challenging task considering the varieties of modalities, the complicated temporal contex of the video as well as the time dependency of the emotional states. Feature extraction, multi-modal fusion and temporal context fusion are crucial stages for predicting valence and arousal values in the emotional impact, but have not been successfully exploited. In this paper, we propose a comprehensive framework with novel designs of modal structure and multi-modal fusion strategy. We select the most suitable modalities for valence and arousal tasks respectively and each modal feature is extracted using the modality-specific pre-trained deep model on large generic dataset. Two-time-scale structures, one for the intra-clip and the other for the inter-clip, are proposed to capture the temporal dependency of video content and emotion states. To combine the complementary information from multiple modalities, an effective and efficient residual-based progressive training strategy is proposed. Each modality is step-wisely combined into the multi-modal model, responsible for completing the missing parts of features. With all those improvements above, our proposed prediction framework achieves better performance on the LIRIS-ACCEDE dataset with a large margin compared to the state-of-the-art.

Via

Access Paper or Ask Questions