Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Meitan Wang

Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting

Oct 14, 2024

Yifan Luo, Zhennan Zhou, Meitan Wang, Bin Dong

Figure 1 for Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting

Figure 2 for Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting

Figure 3 for Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting

Figure 4 for Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting

Abstract:In this paper, we investigate the safety mechanisms of instruction fine-tuned large language models (LLMs). We discover that re-weighting MLP neurons can significantly compromise a model's safety, especially for MLPs in end-of-sentence inferences. We hypothesize that LLMs evaluate the harmfulness of prompts during end-of-sentence inferences, and MLP layers plays a critical role in this process. Based on this hypothesis, we develop 2 novel white-box jailbreak methods: a prompt-specific method and a prompt-general method. The prompt-specific method targets individual prompts and optimizes the attack on the fly, while the prompt-general method is pre-trained offline and can generalize to unseen harmful prompts. Our methods demonstrate robust performance across 7 popular open-source LLMs, size ranging from 2B to 72B. Furthermore, our study provides insights into vulnerabilities of instruction-tuned LLM's safety and deepens the understanding of the internal mechanisms of LLMs.

Via

Access Paper or Ask Questions