Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Multi-modal Robustness Analysis Against Language and Visual Perturbations

Jul 06, 2022

Madeline C. Schiappa, Shruti Vyas, Hamid Palangi, Yogesh S. Rawat, Vibhav Vineet

Figure 1 for Multi-modal Robustness Analysis Against Language and Visual Perturbations

Figure 2 for Multi-modal Robustness Analysis Against Language and Visual Perturbations

Figure 3 for Multi-modal Robustness Analysis Against Language and Visual Perturbations

Figure 4 for Multi-modal Robustness Analysis Against Language and Visual Perturbations

Share this with someone who'll enjoy it:

Abstract:Joint visual and language modeling on large-scale datasets has recently shown a good progress in multi-modal tasks when compared to single modal learning. However, robustness of these approaches against real-world perturbations has not been studied. In this work, we perform the first extensive robustness study of such models against various real-world perturbations focusing on video and language. We focus on text-to-video retrieval and propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different textual perturbations. The study reveals some interesting findings: 1) The studied models are more robust when text is perturbed versus when video is perturbed 2) The transformer text encoder is more robust on non-semantic changing text perturbations and visual perturbations compared to word embedding approaches. 3) Using two-branch encoders in isolation is typically more robust than when architectures use cross-attention. We hope this study will serve as a benchmark and guide future research in robust multimodal learning.

* 29 pages, 21 figures. This projects webpage is located at https://maddy12.github.io/MultiModalVideoRobustness/

View paper on

Share this with someone who'll enjoy it:

Title:Multi-modal Robustness Analysis Against Language and Visual Perturbations

Paper and Code