Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust

Oct 02, 2024

Asher J. Hancock, Allen Z. Ren, Anirudha Majumdar

Figure 1 for Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust

Figure 2 for Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust

Figure 3 for Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust

Figure 4 for Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust

Share this with someone who'll enjoy it:

Abstract:Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model's sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model fine-tuning or access to the model's weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds, which otherwise degrade task success rates by up to 40%. Website with additional information, videos, and code: https://aasherh.github.io/byovla/ .

* Website: https://aasherh.github.io/byovla/

View paper on

Share this with someone who'll enjoy it:

Title:Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust

Paper and Code