Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Dec 05, 2019

Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach

Figure 1 for Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Figure 2 for Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Figure 3 for Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Figure 4 for Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Share this with someone who'll enjoy it:

Abstract:Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label on a warning sign warns people about the danger in the scene. Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question. However, existing approaches for TextVQA are mostly based on custom pairwise fusion mechanisms between a pair of two modalities and are restricted to a single prediction step by casting TextVQA as a classification task. In this work, we propose a novel model for the TextVQA task based on a multimodal transformer architecture accompanied by a rich representation for text in images. Our model naturally fuses different modalities homogeneously by embedding them into a common semantic space where self-attention is applied to model inter- and intra- modality context. Furthermore, it enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification. Our model outperforms existing approaches on three benchmark datasets for the TextVQA task by a large margin.

View paper on

Share this with someone who'll enjoy it:

Title:Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Paper and Code