Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Jun 04, 2021

Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Songfang Huang, Wenming Xiao, Fei Huang

Figure 1 for E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Figure 2 for E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Figure 3 for E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Figure 4 for E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Share this with someone who'll enjoy it:

Abstract:Vision-language pre-training (VLP) on large-scale image-text pairs has achieved huge success for the cross-modal downstream tasks. The most existing pre-training methods mainly adopt a two-step training procedure, which firstly employs a pre-trained object detector to extract region-based visual features, then concatenates the image representation and text embedding as the input of Transformer to train. However, these methods face problems of using task-specific visual representation of the specific object detector for generic cross-modal understanding, and the computation inefficiency of two-stage pipeline. In this paper, we propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where we build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text. We incorporate the tasks of object detection and image captioning into pre-training with a unified Transformer encoder-decoder architecture for enhancing visual learning. An extensive set of experiments have been conducted on well-established vision-language downstream tasks to demonstrate the effectiveness of this novel VLP paradigm.

* ACL2021 main conference

View paper on

Share this with someone who'll enjoy it:

Title:E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Paper and Code