Abstract:In imitation learning for robotic manipulation, decomposing object manipulation tasks into multiple semantic actions is essential. This decomposition enables the reuse of learned skills in varying contexts and the combination of acquired skills to perform novel tasks, rather than merely replicating demonstrated motions. Gaze, an evolutionary tool for understanding ongoing events, plays a critical role in human object manipulation, where it strongly correlates with motion planning. In this study, we propose a simple yet robust task decomposition method based on gaze transitions. We hypothesize that an imitation agent's gaze control, fixating on specific landmarks and transitioning between them, naturally segments demonstrated manipulations into sub-tasks. Notably, our method achieves consistent task decomposition across all demonstrations, which is desirable in contexts such as machine learning. Using teleoperation, a common modality in imitation learning for robotic manipulation, we collected demonstration data for various tasks, applied our segmentation method, and evaluated the characteristics and consistency of the resulting sub-tasks. Furthermore, through extensive testing across a wide range of hyperparameter variations, we demonstrated that the proposed method possesses the robustness necessary for application to different robotic systems.