Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

Jun 17, 2024

Mingyang Song, Mao Zheng, Xuan Luo

Figure 1 for Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

Figure 2 for Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

Figure 3 for Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

Figure 4 for Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

Share this with someone who'll enjoy it:

Abstract:Leveraging Large Language Models (LLMs) as judges for evaluating the performance of LLMs has recently garnered attention. Nonetheless, this type of approach concurrently introduces potential biases from LLMs, raising concerns about the reliability of the evaluation results. To mitigate this issue, we propose and study two versions of many-shot in-context prompts, Reinforced and Unsupervised ICL, for helping GPT-4o-as-a-Judge in single answer grading. Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the agreement and quality of the evaluation. Furthermore, we first reveal the symbol bias in GPT-4o-as-a-Judge for pairwise comparison and then propose a simple yet effective approach to mitigate it. Experimental results show that advanced long-context LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot regime. Meanwhile, the experimental results further verify the effectiveness of the symbol bias mitigation approach.

* work in progress

View paper on

Share this with someone who'll enjoy it:

Title:Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

Paper and Code