Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:OR-Bench: An Over-Refusal Benchmark for Large Language Models

May 31, 2024

Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh

Figure 1 for OR-Bench: An Over-Refusal Benchmark for Large Language Models

Figure 2 for OR-Bench: An Over-Refusal Benchmark for Large Language Models

Figure 3 for OR-Bench: An Over-Refusal Benchmark for Large Language Models

Figure 4 for OR-Bench: An Over-Refusal Benchmark for Large Language Models

Share this with someone who'll enjoy it:

Abstract:Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where the LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that appear harmful but are benign. This study proposes a novel method for automatically generating large-scale sets of ``seemingly toxic prompts'' (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at https://huggingface.co/datasets/bench-llm/OR-Bench and the corresponding demo can be found at https://huggingface.co/spaces/bench-llm/or-bench. We hope this benchmark can help the community develop better safety aligned models.

* version 1

View paper on

Share this with someone who'll enjoy it:

Title:OR-Bench: An Over-Refusal Benchmark for Large Language Models

Paper and Code