Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis

Feb 21, 2024

Yueqi Xie, Minghong Fang, Renjie Pi, Neil Gong

Figure 1 for GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis

Figure 2 for GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis

Figure 3 for GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis

Figure 4 for GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis

Share this with someone who'll enjoy it:

Abstract:Large Language Models (LLMs) face threats from unsafe prompts. Existing methods for detecting unsafe prompts are primarily online moderation APIs or finetuned LLMs. These strategies, however, often require extensive and resource-intensive data collection and training processes. In this study, we propose GradSafe, which effectively detects unsafe prompts by scrutinizing the gradients of safety-critical parameters in LLMs. Our methodology is grounded in a pivotal observation: the gradients of an LLM's loss for unsafe prompts paired with compliance response exhibit similar patterns on certain safety-critical parameters. In contrast, safe prompts lead to markedly different gradient patterns. Building on this observation, GradSafe analyzes the gradients from prompts (paired with compliance responses) to accurately detect unsafe prompts. We show that GradSafe, applied to Llama-2 without further training, outperforms Llama Guard, despite its extensive finetuning with a large dataset, in detecting unsafe prompts. This superior performance is consistent across both zero-shot and adaptation scenarios, as evidenced by our evaluations on the ToxicChat and XSTest. The source code is available at https://github.com/xyq7/GradSafe.

View paper on

Share this with someone who'll enjoy it:

Title:GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis

Paper and Code