Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Weiliang Zhao

Diversity Helps Jailbreak Large Language Models

Nov 06, 2024

Weiliang Zhao, Daniel Ben-Levi, Junfeng Yang, Chengzhi Mao

Figure 1 for Diversity Helps Jailbreak Large Language Models

Figure 2 for Diversity Helps Jailbreak Large Language Models

Figure 3 for Diversity Helps Jailbreak Large Language Models

Figure 4 for Diversity Helps Jailbreak Large Language Models

Abstract:We have uncovered a powerful jailbreak technique that leverages large language models' ability to diverge from prior context, enabling them to bypass safety constraints and generate harmful outputs. By simply instructing the LLM to deviate and obfuscate previous attacks, our method dramatically outperforms existing approaches, achieving up to a 62% higher success rate in compromising nine leading chatbots, including GPT-4, Gemini, and Llama, while using only 13% of the queries. This revelation exposes a critical flaw in current LLM safety training, suggesting that existing methods may merely mask vulnerabilities rather than eliminate them. Our findings sound an urgent alarm for the need to revolutionize testing methodologies to ensure robust and reliable LLM security.

* arXiv admin note: text overlap with arXiv:2312.02119

Via

Access Paper or Ask Questions

Learning to Rewrite: Generalized LLM-Generated Text Detection

Aug 08, 2024

Wei Hao, Ran Li, Weiliang Zhao, Junfeng Yang, Chengzhi Mao

Figure 1 for Learning to Rewrite: Generalized LLM-Generated Text Detection

Figure 2 for Learning to Rewrite: Generalized LLM-Generated Text Detection

Figure 3 for Learning to Rewrite: Generalized LLM-Generated Text Detection

Figure 4 for Learning to Rewrite: Generalized LLM-Generated Text Detection

Abstract:Large language models (LLMs) can be abused at scale to create non-factual content and spread disinformation. Detecting LLM-generated content is essential to mitigate these risks, but current classifiers often fail to generalize in open-world contexts. Prior work shows that LLMs tend to rewrite LLM-generated content less frequently, which can be used for detection and naturally generalizes to unforeseen data. However, we find that the rewriting edit distance between human and LLM content can be indistinguishable across domains, leading to detection failures. We propose training an LLM to rewrite input text, producing minimal edits for LLM-generated content and more edits for human-written text, deriving a distinguishable and generalizable edit distance difference across different domains. Experiments on text from 21 independent domains and three popular LLMs (e.g., GPT-4o, Gemini, and Llama-3) show that our classifier outperforms the state-of-the-art zero-shot classifier by up to 20.6% on AUROC score and the rewriting classifier by 9.2% on F1 score. Our work suggests that LLM can effectively detect machine-generated text if they are trained properly.

Via

Access Paper or Ask Questions