Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Jul 04, 2024

John Mendonça, Alon Lavie, Isabel Trancoso

Figure 1 for On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Figure 2 for On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Figure 3 for On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Figure 4 for On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Share this with someone who'll enjoy it:

Abstract:Large Language Models (LLMs) have showcased remarkable capabilities in various Natural Language Processing tasks. For automatic open-domain dialogue evaluation in particular, LLMs have been seamlessly integrated into evaluation frameworks, and together with human evaluation, compose the backbone of most evaluations. However, existing evaluation benchmarks often rely on outdated datasets and evaluate aspects like Fluency and Relevance, which fail to adequately capture the capabilities and limitations of state-of-the-art chatbot models. This paper critically examines current evaluation benchmarks, highlighting that the use of older response generators and quality aspects fail to accurately reflect modern chatbot capabilities. A small annotation experiment on a recent LLM-generated dataset (SODA) reveals that LLM evaluators such as GPT-4 struggle to detect actual deficiencies in dialogues generated by current LLM chatbots.

* Accepted to the 6th NLP for Conversational AI workshop at ACL

View paper on

Share this with someone who'll enjoy it:

Title:On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Paper and Code