Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Joshua Marshall

Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation

Jun 01, 2022

Leyla Khaleghi, Joshua Marshall, Ali Etemad

Figure 1 for Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation

Figure 2 for Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation

Figure 3 for Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation

Figure 4 for Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation

Abstract:3D hand pose estimation (HPE) is the process of locating the joints of the hand in 3D from any visual input. HPE has recently received an increased amount of attention due to its key role in a variety of human-computer interaction applications. Recent HPE methods have demonstrated the advantages of employing videos or multi-view images, allowing for more robust HPE systems. Accordingly, in this study, we propose a new method to perform Sequential learning with Transformer for Hand Pose (SeTHPose) estimation. Our SeTHPose pipeline begins by extracting visual embeddings from individual hand images. We then use a transformer encoder to learn the sequential context along time or viewing angles and generate accurate 2D hand joint locations. Then, a graph convolutional neural network with a U-Net configuration is used to convert the 2D hand joint locations to 3D poses. Our experiments show that SeTHPose performs well on both hand sequence varieties, temporal and angular. Also, SeTHPose outperforms other methods in the field to achieve new state-of-the-art results on two public available sequential datasets, STB and MuViHand.

Via

Access Paper or Ask Questions

Multi-View Video-Based 3D Hand Pose Estimation

Sep 24, 2021

Leyla Khaleghi, Alireza Sepas Moghaddam, Joshua Marshall, Ali Etemad

Figure 1 for Multi-View Video-Based 3D Hand Pose Estimation

Figure 2 for Multi-View Video-Based 3D Hand Pose Estimation

Figure 3 for Multi-View Video-Based 3D Hand Pose Estimation

Figure 4 for Multi-View Video-Based 3D Hand Pose Estimation

Abstract:Hand pose estimation (HPE) can be used for a variety of human-computer interaction applications such as gesture-based control for physical or virtual/augmented reality devices. Recent works have shown that videos or multi-view images carry rich information regarding the hand, allowing for the development of more robust HPE systems. In this paper, we present the Multi-View Video-Based 3D Hand (MuViHand) dataset, consisting of multi-view videos of the hand along with ground-truth 3D pose labels. Our dataset includes more than 402,000 synthetic hand images available in 4,560 videos. The videos have been simultaneously captured from six different angles with complex backgrounds and random levels of dynamic lighting. The data has been captured from 10 distinct animated subjects using 12 cameras in a semi-circle topology where six tracking cameras only focus on the hand and the other six fixed cameras capture the entire body. Next, we implement MuViHandNet, a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand, recurrent learners to learn both temporal and angular sequential information, and graph networks with U-Net architectures to estimate the final 3D pose information. We perform extensive experiments and show the challenging nature of this new dataset as well as the effectiveness of our proposed method. Ablation studies show the added value of each component in MuViHandNet, as well as the benefit of having temporal and sequential information in the dataset.

* 14 pages, 15 figures

Via

Access Paper or Ask Questions