Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chashi Mahiul Islam

Are Vision Transformer Representations Semantically Meaningful? A Case Study in Medical Imaging

Jul 02, 2025

Montasir Shams, Chashi Mahiul Islam, Shaeke Salman, Phat Tran, Xiuwen Liu

Abstract:Vision transformers (ViTs) have rapidly gained prominence in medical imaging tasks such as disease classification, segmentation, and detection due to their superior accuracy compared to conventional deep learning models. However, due to their size and complex interactions via the self-attention mechanism, they are not well understood. In particular, it is unclear whether the representations produced by such models are semantically meaningful. In this paper, using a projected gradient-based algorithm, we show that their representations are not semantically meaningful and they are inherently vulnerable to small changes. Images with imperceptible differences can have very different representations; on the other hand, images that should belong to different semantic classes can have nearly identical representations. Such vulnerability can lead to unreliable classification results; for example, unnoticeable changes cause the classification accuracy to be reduced by over 60\%. %. To the best of our knowledge, this is the first work to systematically demonstrate this fundamental lack of semantic meaningfulness in ViT representations for medical image classification, revealing a critical challenge for their deployment in safety-critical systems.

* 9 pages

Via

Access Paper or Ask Questions

DeepSeek on a Trip: Inducing Targeted Visual Hallucinations via Representation Vulnerabilities

Feb 11, 2025

Chashi Mahiul Islam, Samuel Jacob Chacko, Preston Horne, Xiuwen Liu

Figure 1 for DeepSeek on a Trip: Inducing Targeted Visual Hallucinations via Representation Vulnerabilities

Figure 2 for DeepSeek on a Trip: Inducing Targeted Visual Hallucinations via Representation Vulnerabilities

Figure 3 for DeepSeek on a Trip: Inducing Targeted Visual Hallucinations via Representation Vulnerabilities

Figure 4 for DeepSeek on a Trip: Inducing Targeted Visual Hallucinations via Representation Vulnerabilities

Abstract:Multimodal Large Language Models (MLLMs) represent the cutting edge of AI technology, with DeepSeek models emerging as a leading open-source alternative offering competitive performance to closed-source systems. While these models demonstrate remarkable capabilities, their vision-language integration mechanisms introduce specific vulnerabilities. We implement an adapted embedding manipulation attack on DeepSeek Janus that induces targeted visual hallucinations through systematic optimization of image embeddings. Through extensive experimentation across COCO, DALL-E 3, and SVIT datasets, we achieve hallucination rates of up to 98.0% while maintaining high visual fidelity (SSIM > 0.88) of the manipulated images on open-ended questions. Our analysis demonstrates that both 1B and 7B variants of DeepSeek Janus are susceptible to these attacks, with closed-form evaluation showing consistently higher hallucination rates compared to open-ended questioning. We introduce a novel multi-prompt hallucination detection framework using LLaMA-3.1 8B Instruct for robust evaluation. The implications of these findings are particularly concerning given DeepSeek's open-source nature and widespread deployment potential. This research emphasizes the critical need for embedding-level security measures in MLLM deployment pipelines and contributes to the broader discussion of responsible AI implementation.

* 19 pages, 4 figures

Via

Access Paper or Ask Questions

Mechanistic Understandings of Representation Vulnerabilities and Engineering Robust Vision Transformers

Feb 07, 2025

Chashi Mahiul Islam, Samuel Jacob Chacko, Mao Nishino, Xiuwen Liu

Abstract:While transformer-based models dominate NLP and vision applications, their underlying mechanisms to map the input space to the label space semantically are not well understood. In this paper, we study the sources of known representation vulnerabilities of vision transformers (ViT), where perceptually identical images can have very different representations and semantically unrelated images can have the same representation. Our analysis indicates that imperceptible changes to the input can result in significant representation changes, particularly in later layers, suggesting potential instabilities in the performance of ViTs. Our comprehensive study reveals that adversarial effects, while subtle in early layers, propagate and amplify through the network, becoming most pronounced in middle to late layers. This insight motivates the development of NeuroShield-ViT, a novel defense mechanism that strategically neutralizes vulnerable neurons in earlier layers to prevent the cascade of adversarial effects. We demonstrate NeuroShield-ViT's effectiveness across various attacks, particularly excelling against strong iterative attacks, and showcase its remarkable zero-shot generalization capabilities. Without fine-tuning, our method achieves a competitive accuracy of 77.8% on adversarial examples, surpassing conventional robustness methods. Our results shed new light on how adversarial effects propagate through ViT layers, while providing a promising approach to enhance the robustness of vision transformers against adversarial attacks. Additionally, they provide a promising approach to enhance the robustness of vision transformers against adversarial attacks.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

Malicious Path Manipulations via Exploitation of Representation Vulnerabilities of Vision-Language Navigation Systems

Jul 10, 2024

Chashi Mahiul Islam, Shaeke Salman, Montasir Shams, Xiuwen Liu, Piyush Kumar

Figure 1 for Malicious Path Manipulations via Exploitation of Representation Vulnerabilities of Vision-Language Navigation Systems

Figure 2 for Malicious Path Manipulations via Exploitation of Representation Vulnerabilities of Vision-Language Navigation Systems

Figure 3 for Malicious Path Manipulations via Exploitation of Representation Vulnerabilities of Vision-Language Navigation Systems

Figure 4 for Malicious Path Manipulations via Exploitation of Representation Vulnerabilities of Vision-Language Navigation Systems

Abstract:Building on the unprecedented capabilities of large language models for command understanding and zero-shot recognition of multi-modal vision-language transformers, visual language navigation (VLN) has emerged as an effective way to address multiple fundamental challenges toward a natural language interface to robot navigation. However, such vision-language models are inherently vulnerable due to the lack of semantic meaning of the underlying embedding space. Using a recently developed gradient based optimization procedure, we demonstrate that images can be modified imperceptibly to match the representation of totally different images and unrelated texts for a vision-language model. Building on this, we develop algorithms that can adversarially modify a minimal number of images so that the robot will follow a route of choice for commands that require a number of landmarks. We demonstrate that experimentally using a recently proposed VLN system; for a given navigation command, a robot can be made to follow drastically different routes. We also develop an efficient algorithm to detect such malicious modifications reliably based on the fact that the adversarially modified images have much higher sensitivity to added Gaussian noise than the original images.

* 8 pages, 5 figures. This paper has been accepted for publication at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2024

Via

Access Paper or Ask Questions