Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

May 30, 2024

Ruochen Zhao, Wenxuan Zhang, Yew Ken Chia, Deli Zhao, Lidong Bing

Figure 1 for Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

Figure 2 for Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

Figure 3 for Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

Figure 4 for Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

Share this with someone who'll enjoy it:

Abstract:As LLMs evolve on a daily basis, there is an urgent need for a trustworthy evaluation method that can provide robust evaluation results in a timely fashion. Currently, as static benchmarks are prone to contamination concerns, users tend to trust human voting platforms, such as Chatbot Arena. However, human annotations require extensive manual efforts. To provide an automatic, robust, and trustworthy evaluation framework, we innovatively propose the Auto-Arena of LLMs, which automates the entire evaluation process with LLM agents. Firstly, an examiner LLM devises queries. Then, a pair of candidate LLMs engage in a multi-round peer-battle around the query, during which the LLM's true performance gaps become visible. Finally, a committee of LLM judges collectively discuss and determine the winner, which alleviates bias and promotes fairness. In our extensive experiment on the 17 newest LLMs, Auto-Arena shows the highest correlation with human preferences, providing a promising alternative to human evaluation platforms.

View paper on

Share this with someone who'll enjoy it:

Title:Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

Paper and Code