Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Jul 02, 2024

Yan Yang, Zeguan Xiao, Xin Lu, Hongru Wang, Hailiang Huang, Guanhua Chen, Yun Chen

Figure 1 for SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Figure 2 for SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Figure 3 for SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Figure 4 for SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Share this with someone who'll enjoy it:

Abstract:The widespread applications of large language models (LLMs) have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SoP, a simple yet effective framework to design jailbreak prompts automatically. Inspired by the social facilitation concept, SoP generates and optimizes multiple jailbreak characters to bypass the guardrails of the target LLM. Different from previous work which relies on proprietary LLMs or seed jailbreak templates crafted by human expertise, SoP can generate and optimize the jailbreak prompt in a cold-start scenario using open-sourced LLMs without any seed jailbreak templates. Experimental results show that SoP achieves attack success rates of 88% and 60% in bypassing the safety alignment of GPT-3.5-1106 and GPT-4, respectively. Furthermore, we extensively evaluate the transferability of the generated templates across different LLMs and held-out malicious requests, while also exploring defense strategies against the jailbreak attack designed by SoP. Code is available at https://github.com/Yang-Yan-Yang-Yan/SoP.

View paper on

Share this with someone who'll enjoy it:

Title:SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Paper and Code