Abstract:Multi-person pose estimation is one of the mainstream tasks of computer vision. Existing methods include the top-down methods which need additional human detector and the bottom-up methods which need to complete heuristic grouping after predicting all human keypoints. They all need to deal with the grouping and detection of keypoints separately, resulting in low efficiency. In this work, we propose an end-to-end network framework for multi-person pose regression to predict the instance-aware keypoints directly. This framework uses a cascaded manner: the first stage provides basic estimation. Then we propose the OKSFilter which is used to remove low-quality predictions, so that the second stage could focus on better results for further optimization. In addition, in order to quantify the quality of the predicted poses, we also propose the pose scoring module(PSM), so that when using non-maximum suppression(NMS) in the inference, the correct type and high-quality poses are preserved. We have verified on the COCO keypoint benchmark. The experiments show that our multi-person pose regression network is feasible and effective, and the two newly proposed modules are helpful to improve the performance of the model.