Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:GPT-4V for Robotics: Multimodal Task Planning from Human Demonstration

Nov 20, 2023

Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Figure 1 for GPT-4V for Robotics: Multimodal Task Planning from Human Demonstration

Figure 2 for GPT-4V for Robotics: Multimodal Task Planning from Human Demonstration

Figure 3 for GPT-4V for Robotics: Multimodal Task Planning from Human Demonstration

Figure 4 for GPT-4V for Robotics: Multimodal Task Planning from Human Demonstration

Share this with someone who'll enjoy it:

Abstract:We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), by integrating observations of human actions to facilitate robotic manipulation. This system analyzes videos of humans performing tasks and creates executable robot programs that incorporate affordance insights. The computation starts by analyzing the videos with GPT-4V to convert environmental and action details into text, followed by a GPT-4-empowered task planner. In the following analyses, vision systems reanalyze the video with the task plan. Object names are grounded using an open-vocabulary object detector, while focus on the hand-object relation helps to detect the moment of grasping and releasing. This spatiotemporal grounding allows the vision systems to further gather affordance data (e.g., grasp type, way points, and body postures). Experiments across various scenarios demonstrate this method's efficacy in achieving real robots' operations from human demonstrations in a zero-shot manner. The prompts of GPT-4V/GPT-4 are available at this project page: https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/

* 8 pages, 10 figures, 1 table. Last updated on November 20th, 2023

View paper on

Share this with someone who'll enjoy it:

Title:GPT-4V for Robotics: Multimodal Task Planning from Human Demonstration

Paper and Code