Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jonas Probst

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

Aug 19, 2024

Sebastian Heineking, Jonas Probst, Daniel Steinbach, Martin Potthast, Harrisen Scells

Figure 1 for Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

Figure 2 for Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

Abstract:Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Most evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required, such as health, and where misleading or incorrect answers can have a significant impact on a user's health. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking signals as a substitute for explicit relevance judgements. Our scoring method correlates with the preferences of human experts. We validate it by investigating the well-known fact that the quality of generated answers improves with the size of the model as well as with more sophisticated prompting strategies.

Via

Access Paper or Ask Questions