Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hao Yuan Bai

iWISDM: Assessing instruction following in multimodal models at scale

Jun 20, 2024

Xiaoxuan Lei, Lucas Gomez, Hao Yuan Bai, Pouya Bashivan

Figure 1 for iWISDM: Assessing instruction following in multimodal models at scale

Figure 2 for iWISDM: Assessing instruction following in multimodal models at scale

Figure 3 for iWISDM: Assessing instruction following in multimodal models at scale

Figure 4 for iWISDM: Assessing instruction following in multimodal models at scale

Abstract:The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs (either text or vision), narrowing the scope of multimodal assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap between these models' ability to precisely follow instructions with that of humans.

Via

Access Paper or Ask Questions