Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Md Montasir Bin Shams

Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models

Jul 01, 2024

Shaeke Salman, Md Montasir Bin Shams, Xiuwen Liu

Abstract:Utilizing a shared embedding space, emerging multimodal models exhibit unprecedented zero-shot capabilities. However, the shared embedding space could lead to new vulnerabilities if different modalities can be misaligned. In this paper, we extend and utilize a recently developed effective gradient-based procedure that allows us to match the embedding of a given text by minimally modifying an image. Using the procedure, we show that we can align the embeddings of distinguishable texts to any image through unnoticeable adversarial attacks in joint image-text models, revealing that semantically unrelated images can have embeddings of identical texts and at the same time visually indistinguishable images can be matched to the embeddings of very different texts. Our technique achieves 100\% success rate when it is applied to text datasets and images from multiple sources. Without overcoming the vulnerability, multimodal models cannot robustly align inputs from different modalities in a semantically meaningful way. \textbf{Warning: the text data used in this paper are toxic in nature and may be offensive to some readers.}

* 14 pages, 14 figures. arXiv admin note: substantial text overlap with arXiv:2401.15568, arXiv:2402.08473

Via

Access Paper or Ask Questions

Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Feb 13, 2024

Shaeke Salman, Md Montasir Bin Shams, Xiuwen Liu, Lingjiong Zhu

Figure 1 for Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Figure 2 for Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Figure 3 for Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Figure 4 for Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Abstract:Transformer-based models have dominated natural language processing and other areas in the last few years due to their superior (zero-shot) performance on benchmark datasets. However, these models are poorly understood due to their complexity and size. While probing-based methods are widely used to understand specific properties, the structures of the representation space are not systematically characterized; consequently, it is unclear how such models generalize and overgeneralize to new inputs beyond datasets. In this paper, based on a new gradient descent optimization method, we are able to explore the embedding space of a commonly used vision-language model. Using the Imagenette dataset, we show that while the model achieves over 99\% zero-shot classification performance, it fails systematic evaluations completely. Using a linear approximation, we provide a framework to explain the striking differences. We have also obtained similar results using a different model to support that our results are applicable to other transformer models with continuous inputs. We also propose a robust way to detect the modified images.

* 30 pages, 30 figures

Via

Access Paper or Ask Questions

Intriguing Equivalence Structures of the Embedding Space of Vision Transformers

Jan 28, 2024

Shaeke Salman, Md Montasir Bin Shams, Xiuwen Liu

Abstract:Pre-trained large foundation models play a central role in the recent surge of artificial intelligence, resulting in fine-tuned models with remarkable abilities when measured on benchmark datasets, standard exams, and applications. Due to their inherent complexity, these models are not well understood. While small adversarial inputs to such models are well known, the structures of the representation space are not well characterized despite their fundamental importance. In this paper, using the vision transformers as an example due to the continuous nature of their input space, we show via analyses and systematic experiments that the representation space consists of large piecewise linear subspaces where there exist very different inputs sharing the same representations, and at the same time, local normal spaces where there are visually indistinguishable inputs having very different representations. The empirical results are further verified using the local directional estimations of the Lipschitz constants of the underlying models. Consequently, the resulting representations change the results of downstream models, and such models are subject to overgeneralization and with limited semantically meaningful generalization capability.

* 8 pages, 9 figures

Via

Access Paper or Ask Questions