Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

Feb 17, 2025

Francesco Croce, Christian Schlarmann, Naman Deep Singh, Matthias Hein

Figure 1 for Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

Figure 2 for Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

Figure 3 for Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

Figure 4 for Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

Share this with someone who'll enjoy it:

Abstract:Measuring perceptual similarity is a key tool in computer vision. In recent years perceptual metrics based on features extracted from neural networks with large and diverse training sets, e.g. CLIP, have become popular. At the same time, the metrics extracted from features of neural networks are not adversarially robust. In this paper we show that adversarially robust CLIP models, called R-CLIP$_\textrm{F}$, obtained by unsupervised adversarial fine-tuning induce a better and adversarially robust perceptual metric that outperforms existing metrics in a zero-shot setting, and further matches the performance of state-of-the-art metrics while being robust after fine-tuning. Moreover, our perceptual metric achieves strong performance on related tasks such as robust image-to-image retrieval, which becomes especially relevant when applied to "Not Safe for Work" (NSFW) content detection and dataset filtering. While standard perceptual metrics can be easily attacked by a small perturbation completely degrading NSFW detection, our robust perceptual metric maintains high accuracy under an attack while having similar performance for unperturbed images. Finally, perceptual metrics induced by robust CLIP models have higher interpretability: feature inversion can show which images are considered similar, while text inversion can find what images are associated to a given prompt. This also allows us to visualize the very rich visual concepts learned by a CLIP model, including memorized persons, paintings and complex queries.

* This work has been accepted for publication in the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). The final version will be available on IEEE Xplore

View paper on

Share this with someone who'll enjoy it:

Title:Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

Paper and Code