Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Aria-UI: Visual Grounding for GUI Instructions

Dec 20, 2024

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, Junnan Li

Figure 1 for Aria-UI: Visual Grounding for GUI Instructions

Figure 2 for Aria-UI: Visual Grounding for GUI Instructions

Figure 3 for Aria-UI: Visual Grounding for GUI Instructions

Figure 4 for Aria-UI: Visual Grounding for GUI Instructions

Share this with someone who'll enjoy it:

Abstract:Digital agents for automating tasks across different platforms by directly manipulating the GUIs are increasingly important. For these agents, grounding from language instructions to target elements remains a significant challenge due to reliance on HTML or AXTree inputs. In this paper, we introduce Aria-UI, a large multimodal model specifically designed for GUI grounding. Aria-UI adopts a pure-vision approach, eschewing reliance on auxiliary inputs. To adapt to heterogeneous planning instructions, we propose a scalable data pipeline that synthesizes diverse and high-quality instruction samples for grounding. To handle dynamic contexts in task performing, Aria-UI incorporates textual and text-image interleaved action histories, enabling robust context-aware reasoning for grounding. Aria-UI sets new state-of-the-art results across offline and online agent benchmarks, outperforming both vision-only and AXTree-reliant baselines. We release all training data and model checkpoints to foster further research at https://ariaui.github.io.

View paper on

Share this with someone who'll enjoy it:

Title:Aria-UI: Visual Grounding for GUI Instructions

Paper and Code