Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ViSTa Dataset: Do vision-language models understand sequential tasks?

Nov 21, 2024

Evžen Wybitul, Evan Ryan Gunter, Mikhail Seleznyov, David Lindner

Figure 1 for ViSTa Dataset: Do vision-language models understand sequential tasks?

Figure 2 for ViSTa Dataset: Do vision-language models understand sequential tasks?

Figure 3 for ViSTa Dataset: Do vision-language models understand sequential tasks?

Figure 4 for ViSTa Dataset: Do vision-language models understand sequential tasks?

Share this with someone who'll enjoy it:

Abstract:Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a particular final outcome. We explore VLMs' potential to supervise tasks that cannot be scored by the final state alone. To this end, we introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments. Its novel hierarchical structure -- basic single-step tasks composed into more and more complex sequential tasks -- allows a fine-grained understanding of how well VLMs can judge tasks with varying complexity. To illustrate this, we use ViSTa to evaluate state-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o. We find that, while they are all good at object recognition, they fail to understand sequential tasks, with only GPT-4o achieving non-trivial performance.

View paper on

Share this with someone who'll enjoy it:

Title:ViSTa Dataset: Do vision-language models understand sequential tasks?

Paper and Code