Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions

Sep 30, 2024

Laurène Vaugrante, Mathias Niepert, Thilo Hagendorff

Figure 1 for A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions

Figure 2 for A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions

Figure 3 for A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions

Figure 4 for A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions

Share this with someone who'll enjoy it:

Abstract:In an era where large language models (LLMs) are increasingly integrated into a wide range of everyday applications, research into these models' behavior has surged. However, due to the novelty of the field, clear methodological guidelines are lacking. This raises concerns about the replicability and generalizability of insights gained from research on LLM behavior. In this study, we discuss the potential risk of a replication crisis and support our concerns with a series of replication experiments focused on prompt engineering techniques purported to influence reasoning abilities in LLMs. We tested GPT-3.5, GPT-4o, Gemini 1.5 Pro, Claude 3 Opus, Llama 3-8B, and Llama 3-70B, on the chain-of-thought, EmotionPrompting, ExpertPrompting, Sandbagging, as well as Re-Reading prompt engineering techniques, using manually double-checked subsets of reasoning benchmarks including CommonsenseQA, CRT, NumGLUE, ScienceQA, and StrategyQA. Our findings reveal a general lack of statistically significant differences across nearly all techniques tested, highlighting, among others, several methodological weaknesses in previous research. We propose a forward-looking approach that includes developing robust methodologies for evaluating LLMs, establishing sound benchmarks, and designing rigorous experimental frameworks to ensure accurate and reliable assessments of model outputs.

View paper on

Share this with someone who'll enjoy it:

Title:A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions

Paper and Code