Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

Jul 17, 2023

Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, Zhenzhong Lan

Figure 1 for Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

Figure 2 for Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

Figure 3 for Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

Figure 4 for Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

Share this with someone who'll enjoy it:

Abstract:Researchers have invested considerable effort into ensuring that large language models (LLMs) align with human values, using various training techniques, such as instruction tuning and Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF), to guard against text unsafety. However, these defenses remain incredibly vulnerable to some jailbreak attacks, which can cause the model to become overly defensive to sensitive topics or still generate harmful content, leaving the model performance particularly fragile. Therefore, to comprehensively study text safety and output robustness, we propose a latent jailbreak prompt dataset, each involving malicious instruction embedding. Specifically, we instruct the model to complete a regular task, such as translation, where the text to be translated contains malicious instructions. To further analyze the safety and robustness, we design a hierarchical annotation framework. We present a systematic analysis of the safety and robustness of LLMs concerning the position of explicit normal instructions, word replacement (verbs in explicit normal instructions, target groups in malicious instructions, cue words in malicious instructions), and instruction replacement (different explicit normal instructions). Our results show that current LLMs not only have a preference for certain instruction verbs, but also exhibit different jailbreak rates for different instruction verbs in explicit normal instructions. In other words, the probability of generating unsafe content by the model will be reinforced to varying degrees depending on the instruction verb in explicit normal instructions. Code and data are available at https://github.com/qiuhuachuan/latent-jailbreak.

* Code and data are available at https://github.com/qiuhuachuan/latent-jailbreak

View paper on

Share this with someone who'll enjoy it:

Title:Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

Paper and Code