Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Robust LLM safeguarding via refusal feature adversarial training

Sep 30, 2024

Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda

Figure 1 for Robust LLM safeguarding via refusal feature adversarial training

Figure 2 for Robust LLM safeguarding via refusal feature adversarial training

Figure 3 for Robust LLM safeguarding via refusal feature adversarial training

Figure 4 for Robust LLM safeguarding via refusal feature adversarial training

Share this with someone who'll enjoy it:

Abstract:Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods.

View paper on

Share this with someone who'll enjoy it:

Title:Robust LLM safeguarding via refusal feature adversarial training

Paper and Code