Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Riley Goodside

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Aug 27, 2024

Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, Summer Yue

Figure 1 for LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Figure 2 for LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Figure 3 for LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Figure 4 for LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Abstract:Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.

Via

Access Paper or Ask Questions