Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning

Mar 30, 2025

Ashim Dahal, Saydul Akbar Murad, Nick Rahimi

Share this with someone who'll enjoy it:

Abstract:Understanding the representation shift on Vision Language Models like CLIP under different augmentations provides valuable insights on Mechanistic Interpretability. In this study, we show the shift on CLIP's embeddings on 9 common augmentation techniques: noise, blur, color jitter, scale and rotate, flip, elastic and perspective transforms, random brightness and contrast, and coarse dropout of pixel blocks. We scrutinize the embedding shifts under similarity on attention map, patch, edge, detail preservation, cosine similarity, L2 distance, pairwise distance and dendrogram clusters and provide qualitative analysis on sample images. Our findings suggest certain augmentations like noise, perspective transform and shift scaling have higher degree of drastic impact on embedding shift. This study provides a concrete foundation for future work on VLM's robustness for mechanical interpretation and adversarial data defense.

* accepted at MIV at CVPR 2025

View paper on

Share this with someone who'll enjoy it:

Title:Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning

Paper and Code