Abstract:Existing Object Pose Estimation (OPE) methods for stacked scenarios are not robust to changes in object scale. This paper proposes a new 6DoF OPE network (NormNet) for different scale objects in stacked scenarios. Specifically, each object's scale is first learned with point-wise regression. Then, all objects in the stacked scenario are normalized into the same scale through semantic segmentation and affine transformation. Finally, they are fed into a shared pose estimator to recover their 6D poses. In addition, we introduce a new Sim-to-Real transfer pipeline, combining style transfer and domain randomization. This improves the NormNet's performance on real data even if we only train it on synthetic data. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on public benchmarks and the MultiScale dataset we constructed. The real-world experiments show that our method can robustly estimate the 6D pose of objects at different scales.