Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Oct 30, 2024

Hao Li, Xiaogeng Liu, Chaowei Xiao

Figure 1 for InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Figure 2 for InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Figure 3 for InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Figure 4 for InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Share this with someone who'll enjoy it:

Abstract:Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense -- falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose InjecGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. InjecGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks. The code and datasets are released at https://github.com/SaFoLab-WISC/InjecGuard.

View paper on

Share this with someone who'll enjoy it:

Title:InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Paper and Code