Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

Feb 19, 2024

Xinbei Ma, Zhuosheng Zhang, Hai Zhao

Figure 1 for Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

Figure 2 for Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

Figure 3 for Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

Figure 4 for Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

Share this with someone who'll enjoy it:

Abstract:Large language models (LLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments, especially for graphical user interface (GUI) automation. However, those GUI agents require comprehensive cognition ability including exhaustive perception and reliable action response. We propose \underline{Co}mprehensive \underline{Co}gnitive LLM \underline{Agent}, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP), to systematically improve the GUI automation performance. First, CEP facilitates the GUI perception through different aspects and granularity, including screenshots and complementary detailed layouts for the visual channel and historical actions for the textual channel. Second, CAP decomposes the action prediction into sub-problems: action type prediction and action target conditioned on the action type. With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.

View paper on

Share this with someone who'll enjoy it:

Title:Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

Paper and Code