Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

Jan 24, 2024

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, Junxian He

Figure 1 for AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

Figure 2 for AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

Figure 3 for AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

Figure 4 for AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

Share this with someone who'll enjoy it:

Abstract:Evaluating large language models (LLMs) as general-purpose agents is essential for understanding their capabilities and facilitating their integration into practical applications. However, the evaluation process presents substantial challenges. A primary obstacle is the benchmarking of agent performance across diverse scenarios within a unified framework, especially in maintaining partially-observable environments and ensuring multi-round interactions. Moreover, current evaluation frameworks mostly focus on the final success rate, revealing few insights during the process and failing to provide a deep understanding of the model abilities. To address these challenges, we introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents. AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit that features easy assessment of agents for multi-faceted analysis through interactive visualization. This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront. Ultimately, AgentBoard serves as a significant step towards demystifying agent behaviors and accelerating the development of stronger LLM agents.

* Preprint

View paper on

Share this with someone who'll enjoy it:

Title:AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

Paper and Code