Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Feb 13, 2024

Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, Bin Hu

Figure 1 for COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Figure 2 for COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Figure 3 for COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Figure 4 for COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Share this with someone who'll enjoy it:

Abstract:Jailbreaks on Large language models (LLMs) have recently received increasing attention. For a comprehensive assessment of LLM safety, it is essential to consider jailbreaks with diverse attributes, such as contextual coherence and sentiment/stylistic variations, and hence it is beneficial to study controllable jailbreaking, i.e. how to enforce control on LLM attacks. In this paper, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. Based on this connection, we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the COLD-Attack framework which unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. The controllability enabled by COLD-Attack leads to diverse new jailbreak scenarios which not only cover the standard setting of generating fluent suffix attacks, but also allow us to address new controllable attack settings such as revising a user query adversarially with minimal paraphrasing, and inserting stealthy attacks in context with left-right-coherence. Our extensive experiments on various LLMs (Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5) show COLD-Attack's broad applicability, strong controllability, high success rate, and attack transferability. Our code is available at https://github.com/Yu-Fangxu/COLD-Attack.

View paper on

Share this with someone who'll enjoy it:

Title:COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Paper and Code