Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:TxT: Crossmodal End-to-End Learning with Transformers

Sep 09, 2021

Jan-Martin O. Steitz, Jonas Pfeiffer, Iryna Gurevych, Stefan Roth

Figure 1 for TxT: Crossmodal End-to-End Learning with Transformers

Figure 2 for TxT: Crossmodal End-to-End Learning with Transformers

Figure 3 for TxT: Crossmodal End-to-End Learning with Transformers

Figure 4 for TxT: Crossmodal End-to-End Learning with Transformers

Share this with someone who'll enjoy it:

Abstract:Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today's multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today's multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.

* To appear at the 43rd DAGM German Conference on Pattern Recognition (GCPR) 2021

View paper on

Share this with someone who'll enjoy it:

Title:TxT: Crossmodal End-to-End Learning with Transformers

Paper and Code