Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation

Dec 03, 2024

Sarthak Kumar Maharana, Baoming Zhang, Leonid Karlinsky, Rogerio Feris, Yunhui Guo

Figure 1 for Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation

Figure 2 for Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation

Figure 3 for Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation

Figure 4 for Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation

Share this with someone who'll enjoy it:

Abstract:Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions at increasing severity levels during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose \framework, a bimodal TTA method specially designed to improve CLIP's robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for better image feature extraction but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in TTA for CLIP, specifically for domains involving image corruption. Particularly, with a ViT-B/16 vision backbone, we obtain mean accuracy improvements of 9.7%, 5.94%, and 5.12% for CIFAR-10C, CIFAR-100C, and ImageNet-C, respectively.

* Preprint. Under review

View paper on

Share this with someone who'll enjoy it:

Title:Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation

Paper and Code