Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Feb 28, 2024

Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, Shiyu Chang

Figure 1 for Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Figure 2 for Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Figure 3 for Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Figure 4 for Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Share this with someone who'll enjoy it:

Abstract:Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at https://github.com/UCSB-NLP-Chang/SemanticSmooth.

* 37 pages

View paper on

Share this with someone who'll enjoy it:

Title:Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Paper and Code