Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Minh Hieu Le

Exploring the Adversarial Capabilities of Large Language Models

Feb 15, 2024

Lukas Struppek, Minh Hieu Le, Dominik Hintersdorf, Kristian Kersting

Figure 1 for Exploring the Adversarial Capabilities of Large Language Models

Figure 2 for Exploring the Adversarial Capabilities of Large Language Models

Figure 3 for Exploring the Adversarial Capabilities of Large Language Models

Figure 4 for Exploring the Adversarial Capabilities of Large Language Models

Abstract:The proliferation of large language models (LLMs) has sparked widespread and general interest due to their strong language generation capabilities, offering great potential for both industry and research. While previous research delved into the security and privacy issues of LLMs, the extent to which these models can exhibit adversarial behavior remains largely unexplored. Addressing this gap, we investigate whether common publicly available LLMs have inherent capabilities to perturb text samples to fool safety measures, so-called adversarial examples resp.~attacks. More specifically, we investigate whether LLMs are inherently able to craft adversarial examples out of benign samples to fool existing safe rails. Our experiments, which focus on hate speech detection, reveal that LLMs succeed in finding adversarial perturbations, effectively undermining hate speech detection systems. Our findings carry significant implications for (semi-)autonomous systems relying on LLMs, highlighting potential challenges in their interaction with existing systems and safety measures.

Via

Access Paper or Ask Questions