Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

May 16, 2024

Kuofeng Gao, Yang Bai, Jiawang Bai, Yong Yang, Shu-Tao Xia

Figure 1 for Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Figure 2 for Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Figure 3 for Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Share this with someone who'll enjoy it:

Abstract:Multi-modal Large Language Models (MLLMs) have recently achieved enhanced performance across various vision-language tasks including visual grounding capabilities. However, the adversarial robustness of visual grounding remains unexplored in MLLMs. To fill this gap, we use referring expression comprehension (REC) as an example task in visual grounding and propose three adversarial attack paradigms as follows. Firstly, untargeted adversarial attacks induce MLLMs to generate incorrect bounding boxes for each object. Besides, exclusive targeted adversarial attacks cause all generated outputs to the same target bounding box. In addition, permuted targeted adversarial attacks aim to permute all bounding boxes among different objects within a single image. Extensive experiments demonstrate that the proposed methods can successfully attack visual grounding capabilities of MLLMs. Our methods not only provide a new perspective for designing novel attacks but also serve as a strong baseline for improving the adversarial robustness for visual grounding of MLLMs.

* ICLR 2024 Workshop on Reliable and Responsible Foundation Models

View paper on

Share this with someone who'll enjoy it:

Title:Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Paper and Code