Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Oct 29, 2024

Farima Fatahi Bayat, Lechen Zhang, Sheza Munir, Lu Wang

Figure 1 for FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Figure 2 for FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Figure 3 for FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Figure 4 for FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Share this with someone who'll enjoy it:

Abstract:Language models (LMs) are widely used by an increasing number of users, underscoring the challenge of maintaining factuality across a broad range of topics. We first present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs' factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on the retrieved evidence from the Web. Importantly, factuality judgment by VERIFY correlates better with human evaluations than existing methods. Using VERIFY, we identify "hallucination prompts" across diverse topics, i.e., those eliciting the highest rates of incorrect and inconclusive LM responses. These prompts form FactBench, a dataset of 1K prompts across 150 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and can be regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama3.1 family on FactBench, yielding the following key findings: (i) Proprietary models exhibit better factuality, with performance declining from Easy to Hard hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual accuracy than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more content labeled as undecidable. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases. Our code and data are publicly available at https://huggingface.co/spaces/launch/factbench.

* 25 pages, 10 figures

View paper on

Share this with someone who'll enjoy it:

Title:FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Paper and Code