Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

Mar 04, 2025

Shaofei Cai, Zhancun Mu, Anji Liu, Yitao Liang

Figure 1 for ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

Figure 2 for ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

Figure 3 for ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

Figure 4 for ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

Share this with someone who'll enjoy it:

Abstract:We aim to develop a goal specification method that is semantically clear, spatially sensitive, and intuitive for human users to guide agent interactions in embodied environments. Specifically, we propose a novel cross-view goal alignment framework that allows users to specify target objects using segmentation masks from their own camera views rather than the agent's observations. We highlight that behavior cloning alone fails to align the agent's behavior with human intent when the human and agent camera views differ significantly. To address this, we introduce two auxiliary objectives: cross-view consistency loss and target visibility loss, which explicitly enhance the agent's spatial reasoning ability. According to this, we develop ROCKET-2, a state-of-the-art agent trained in Minecraft, achieving an improvement in the efficiency of inference 3x to 6x. We show ROCKET-2 can directly interpret goals from human camera views for the first time, paving the way for better human-agent interaction.

View paper on

Share this with someone who'll enjoy it:

Title:ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

Paper and Code