Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SneakyPrompt: Jailbreaking Text-to-image Generative Models

May 20, 2023

Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao

Figure 1 for SneakyPrompt: Jailbreaking Text-to-image Generative Models

Figure 2 for SneakyPrompt: Jailbreaking Text-to-image Generative Models

Figure 3 for SneakyPrompt: Jailbreaking Text-to-image Generative Models

Figure 4 for SneakyPrompt: Jailbreaking Text-to-image Generative Models

Share this with someone who'll enjoy it:

Abstract:Text-to-image generative models such as Stable Diffusion and DALL$\cdot$E 2 have attracted much attention since their publication due to their wide application in the real world. One challenging problem of text-to-image generative models is the generation of Not-Safe-for-Work (NSFW) content, e.g., those related to violence and adult. Therefore, a common practice is to deploy a so-called safety filter, which blocks NSFW content based on either text or image features. Prior works have studied the possible bypass of such safety filters. However, existing works are largely manual and specific to Stable Diffusion's official safety filter. Moreover, the bypass ratio of Stable Diffusion's safety filter is as low as 23.51% based on our evaluation. In this paper, we propose the first automated attack framework, called SneakyPrompt, to evaluate the robustness of real-world safety filters in state-of-the-art text-to-image generative models. Our key insight is to search for alternative tokens in a prompt that generates NSFW images so that the generated prompt (called an adversarial prompt) bypasses existing safety filters. Specifically, SneakyPrompt utilizes reinforcement learning (RL) to guide an agent with positive rewards on semantic similarity and bypass success. Our evaluation shows that SneakyPrompt successfully generated NSFW content using an online model DALL$\cdot$E 2 with its default, closed-box safety filter enabled. At the same time, we also deploy several open-source state-of-the-art safety filters on a Stable Diffusion model and show that SneakyPrompt not only successfully generates NSFW content, but also outperforms existing adversarial attacks in terms of the number of queries and image qualities.

View paper on

Share this with someone who'll enjoy it:

Title:SneakyPrompt: Jailbreaking Text-to-image Generative Models

Paper and Code