Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Ben-Levi

Proactive defense against LLM Jailbreak

Oct 06, 2025

Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi, Zhou Yu, Junfeng Yang

Figure 1 for Proactive defense against LLM Jailbreak

Figure 2 for Proactive defense against LLM Jailbreak

Figure 3 for Proactive defense against LLM Jailbreak

Figure 4 for Proactive defense against LLM Jailbreak

Abstract:The proliferation of powerful large language models (LLMs) has necessitated robust safety alignment, yet these models remain vulnerable to evolving adversarial attacks, including multi-turn jailbreaks that iteratively search for successful queries. Current defenses, primarily reactive and static, often fail to counter these search-based attacks. In this paper, we introduce ProAct, a novel proactive defense framework designed to disrupt and mislead autonomous jailbreaking processes. Our core idea is to intentionally provide adversaries with "spurious responses" that appear to be results of successful jailbreak attacks but contain no actual harmful content. These misleading responses provide false signals to the attacker's internal optimization loop, causing the adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, our method consistently and significantly reduces attack success rates by up to 92\%. When combined with other defense frameworks, it further reduces the success rate of the latest attack strategies to 0\%. ProAct represents an orthogonal defense strategy that can serve as an additional guardrail to enhance LLM safety against the most effective jailbreaking attacks.

Via

Access Paper or Ask Questions

Diversity Helps Jailbreak Large Language Models

Nov 06, 2024

Weiliang Zhao, Daniel Ben-Levi, Junfeng Yang, Chengzhi Mao

Figure 1 for Diversity Helps Jailbreak Large Language Models

Figure 2 for Diversity Helps Jailbreak Large Language Models

Figure 3 for Diversity Helps Jailbreak Large Language Models

Figure 4 for Diversity Helps Jailbreak Large Language Models

Abstract:We have uncovered a powerful jailbreak technique that leverages large language models' ability to diverge from prior context, enabling them to bypass safety constraints and generate harmful outputs. By simply instructing the LLM to deviate and obfuscate previous attacks, our method dramatically outperforms existing approaches, achieving up to a 62% higher success rate in compromising nine leading chatbots, including GPT-4, Gemini, and Llama, while using only 13% of the queries. This revelation exposes a critical flaw in current LLM safety training, suggesting that existing methods may merely mask vulnerabilities rather than eliminate them. Our findings sound an urgent alarm for the need to revolutionize testing methodologies to ensure robust and reliable LLM security.

* arXiv admin note: text overlap with arXiv:2312.02119

Via

Access Paper or Ask Questions