Abstract:Text-to-image diffusion models have been demonstrated with unsafe generation due to unfiltered large-scale training data, such as violent, sexual, and shocking images, necessitating the erasure of unsafe concepts. Most existing methods focus on modifying the generation probabilities conditioned on the texts containing unsafe descriptions. However, they fail to guarantee safe generation for unseen texts in the training phase, especially for the prompts from adversarial attacks. In this paper, we re-analyze the erasure task and point out that existing methods cannot guarantee the minimization of the total probabilities of unsafe generation. To tackle this problem, we propose Dark Miner. It entails a recurring three-stage process that comprises mining, verifying, and circumventing. It greedily mines embeddings with maximum generation probabilities of unsafe concepts and reduces unsafe generation more effectively. In the experiments, we evaluate its performance on two inappropriate concepts, two objects, and two styles. Compared with 6 previous state-of-the-art methods, our method achieves better erasure and defense results in most cases, especially under 4 state-of-the-art attacks, while preserving the model's native generation capability. Our code will be available on GitHub.
Abstract:Watermarking is a tool for actively identifying and attributing the images generated by latent diffusion models. Existing methods face the dilemma of watermark robustness and image quality. The reason for this dilemma is that watermark detection is performed in pixel space, implying an intrinsic link between image quality and watermark robustness. In this paper, we highlight that an effective solution to the problem is to both inject and detect watermarks in latent space, and propose Latent Watermark (LW) with a progressive training strategy. Experiments show that compared to the recently proposed methods such as StegaStamp, StableSignature, RoSteALS and TreeRing, LW not only surpasses them in terms of robustness but also offers superior image quality. When we inject 64-bit messages, LW can achieve an identification performance close to 100% and an attribution performance above 97% under 9 single-attack scenarios and one all-attack scenario. Our code will be available on GitHub.
Abstract:In the era of AIGC, the fast development of visual content generation technologies, such as diffusion models, bring potential security risks to our society. Existing generated image detection methods suffer from performance drop when faced with out-of-domain generators and image scenes. To relieve this problem, we propose Artifact Purification Network (APN) to facilitate the artifact extraction from generated images through the explicit and implicit purification processes. For the explicit one, a suspicious frequency-band proposal method and a spatial feature decomposition method are proposed to extract artifact-related features. For the implicit one, a training strategy based on mutual information estimation is proposed to further purify the artifact-related features. Experiments show that for cross-generator detection, the average accuracy of APN is 5.6% ~ 16.4% higher than the previous 10 methods on GenImage dataset and 1.7% ~ 50.1% on DiffusionForensics dataset. For cross-scene detection, APN maintains its high performance. Via visualization analysis, we find that the proposed method extracts flexible forgery patterns and condenses the forgery information diluted in irrelevant features. We also find that the artifact features APN focuses on across generators and scenes are global and diverse. The code will be available on GitHub.