Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds

Dec 06, 2024

James Beetham, Souradip Chakraborty, Mengdi Wang, Furong Huang, Amrit Singh Bedi, Mubarak Shah

Figure 1 for LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds

Figure 2 for LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds

Figure 3 for LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds

Figure 4 for LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds

Share this with someone who'll enjoy it:

Abstract:Many existing jailbreak techniques rely on solving discrete combinatorial optimization, while more recent approaches involve training LLMs to generate multiple adversarial prompts. However, both approaches require significant computational resources to produce even a single adversarial prompt. We hypothesize that the inefficiency of current approaches stems from an inadequate characterization of the jailbreak problem. To address this gap, we formulate the jailbreak problem in terms of alignment. By starting from an available safety-aligned model, we leverage an unsafe reward to guide the safe model towards generating unsafe outputs using alignment techniques (e.g., reinforcement learning from human feedback), effectively performing jailbreaking via alignment. We propose a novel jailbreak method called LIAR (LeveragIng Alignment to jailbReak). To demonstrate the simplicity and effectiveness of our approach, we employ a best-of-N method to solve the alignment problem. LIAR offers significant advantages: lower computational requirements without additional training, fully black-box operation, competitive attack success rates, and more human-readable prompts. We provide theoretical insights into the possibility of jailbreaking a safety-aligned model, revealing inherent vulnerabilities in current alignment strategies for LLMs. We also provide sub-optimality guarantees for the proposed \algo. Experimentally, we achieve ASR comparable to the SoTA with a 10x improvement to perplexity and a Time-to-Attack measured in seconds rather than tens of hours.

View paper on

Share this with someone who'll enjoy it:

Title:LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds

Paper and Code