Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Apr 26, 2022

Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao

Figure 1 for ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Figure 2 for ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Figure 3 for ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Figure 4 for ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Share this with someone who'll enjoy it:

Abstract:Recently, customized vision transformers have been adapted for human pose estimation and have achieved superior performance with elaborate structures. However, it is still unclear whether plain vision transformers can facilitate pose estimation. In this paper, we take the first step toward answering the question by employing a plain and non-hierarchical vision transformer together with simple deconvolution decoders termed ViTPose for human pose estimation. We demonstrate that a plain vision transformer with MAE pretraining can obtain superior performance after finetuning on human pose estimation datasets. ViTPose has good scalability with respect to model size and flexibility regarding input resolution and token number. Moreover, it can be easily pretrained using the unlabeled pose data without the need for large-scale upstream ImageNet data. Our biggest ViTPose model based on the ViTAE-G backbone with 1 billion parameters obtains the best 80.9 mAP on the MS COCO test-dev set, while the ensemble models further set a new state-of-the-art for human pose estimation, i.e., 81.1 mAP. The source code and models will be released at https://github.com/ViTAE-Transformer/ViTPose.

* Tech report. 81.1 mAP on MS COCO Keypoint Detection test-dev set

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Paper and Code