Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

Dec 17, 2024

Hossein A. Rahmani, Emine Yilmaz, Nick Craswell, Bhaskar Mitra

Figure 1 for JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

Figure 2 for JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

Figure 3 for JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

Figure 4 for JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

Share this with someone who'll enjoy it:

Abstract:The effective training and evaluation of retrieval systems require a substantial amount of relevance judgments, which are traditionally collected from human assessors -- a process that is both costly and time-consuming. Large Language Models (LLMs) have shown promise in generating relevance labels for search tasks, offering a potential alternative to manual assessments. Current approaches often rely on a single LLM, such as GPT-4, which, despite being effective, are expensive and prone to intra-model biases that can favour systems leveraging similar models. In this work, we introduce JudgeBlender, a framework that employs smaller, open-source models to provide relevance judgments by combining evaluations across multiple LLMs (LLMBlender) or multiple prompts (PromptBlender). By leveraging the LLMJudge benchmark [18], we compare JudgeBlender with state-of-the-art methods and the top performers in the LLMJudge challenge. Our results show that JudgeBlender achieves competitive performance, demonstrating that very large models are often unnecessary for reliable relevance assessments.

* 14 pages

View paper on

Share this with someone who'll enjoy it:

Title:JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

Paper and Code