Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aochi Zhang

AgentSims: An Open-Source Sandbox for Large Language Model Evaluation

Aug 08, 2023

Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, Qin Chen

Figure 1 for AgentSims: An Open-Source Sandbox for Large Language Model Evaluation

Figure 2 for AgentSims: An Open-Source Sandbox for Large Language Model Evaluation

Figure 3 for AgentSims: An Open-Source Sandbox for Large Language Model Evaluation

Figure 4 for AgentSims: An Open-Source Sandbox for Large Language Model Evaluation

Abstract:With ChatGPT-like large language models (LLM) prevailing in the community, how to evaluate the ability of LLMs is an open question. Existing evaluation methods suffer from following shortcomings: (1) constrained evaluation abilities, (2) vulnerable benchmarks, (3) unobjective metrics. We suggest that task-based evaluation, where LLM agents complete tasks in a simulated environment, is a one-for-all solution to solve above problems. We present AgentSims, an easy-to-use infrastructure for researchers from all disciplines to test the specific capacities they are interested in. Researchers can build their evaluation tasks by adding agents and buildings on an interactive GUI or deploy and test new support mechanisms, i.e. memory, planning and tool-use systems, by a few lines of codes. Our demo is available at https://agentsims.com .

* submit to EMNLP2023 demo track

Via

Access Paper or Ask Questions