Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Aug 31, 2023

Weihan Wang, Zhen Yang, Bin Xu, Juanzi Li, Yankui Sun

Figure 1 for ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Figure 2 for ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Figure 3 for ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Figure 4 for ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Share this with someone who'll enjoy it:

Abstract:Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of vision-language tasks. Prior arts usually focus on how to align visual and textual features, but strategies for improving the robustness of model and speeding up model convergence are left insufficiently explored. In this paper, we propose a novel method ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs. For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model, which alleviates the problem of treating synonyms of masked words as negative samples in one-hot labels. For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input, encouraging the model to learn high-quality representations by increasing the difficulty of the ITM task. By leveraging the above techniques, our ViLTA can achieve better performance on various vision-language tasks. Extensive experiments on benchmark datasets demonstrate that the effectiveness of ViLTA and its promising potential for vision-language pre-training.

* 15 pages, 5 figures

View paper on

Share this with someone who'll enjoy it:

Title:ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Paper and Code