Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Feb 21, 2024

Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, Yu Qiao

Figure 1 for Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Figure 2 for Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Figure 3 for Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Figure 4 for Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Share this with someone who'll enjoy it:

Abstract:Large language models (LLMs) need to undergo safety alignment to ensure safe conversations with humans. However, in this work, we introduce an inference-time attack framework, demonstrating that safety alignment can also unintentionally facilitate harmful outcomes under adversarial manipulation. This framework, named Emulated Disalignment (ED), adversely combines a pair of open-source pre-trained and safety-aligned language models in the output space to produce a harmful language model without additional training. Our experiments with ED across three datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca) show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rate in 43 out of 48 evaluation subsets by a large margin. Crucially, our findings highlight the importance of reevaluating the practice of open-sourcing language models even after safety alignment.

* Project web page: https://zhziszz.github.io/emulated-disalignment

View paper on

Share this with someone who'll enjoy it:

Title:Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Paper and Code