Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Dec 22, 2024

Lang Gao, Xiangliang Zhang, Preslav Nakov, Xiuying Chen

Figure 1 for Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Figure 2 for Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Figure 3 for Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Figure 4 for Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Share this with someone who'll enjoy it:

Abstract:Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. Yet, there is still insufficient understanding of how jailbreaking works, which makes it hard to develop effective defense strategies. We aim to shed more light into this issue: we conduct a detailed large-scale analysis of seven different jailbreak methods and find that these disagreements stem from insufficient observation samples. In particular, we introduce \textit{safety boundary}, and we find that jailbreaks shift harmful activations outside that safety boundary, where LLMs are less sensitive to harmful information. We also find that the low and the middle layers are critical in such shifts, while deeper layers have less impact. Leveraging on these insights, we propose a novel defense called \textbf{Activation Boundary Defense} (ABD), which adaptively constrains the activations within the safety boundary. We further use Bayesian optimization to selectively apply the defense method to the low and the middle layers. Our experiments on several benchmarks show that ABD achieves an average DSR of over 98\% against various forms of jailbreak attacks, with less than 2\% impact on the model's general capabilities.

* 17 pages, 9 figures

View paper on

Share this with someone who'll enjoy it:

Title:Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Paper and Code