Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Masked Vision and Language Modeling for Multi-modal Representation Learning

Aug 03, 2022

Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Erhan Bas, Rahul Bhotika, Stefano Soatto

Figure 1 for Masked Vision and Language Modeling for Multi-modal Representation Learning

Figure 2 for Masked Vision and Language Modeling for Multi-modal Representation Learning

Figure 3 for Masked Vision and Language Modeling for Multi-modal Representation Learning

Figure 4 for Masked Vision and Language Modeling for Multi-modal Representation Learning

Share this with someone who'll enjoy it:

Abstract:In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method not only achieves state-of-the-art performances by using a large amount of data, but also outperforms the other competitors by a significant margin in the regimes of limited training data.

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:Masked Vision and Language Modeling for Multi-modal Representation Learning

Paper and Code