Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Felipe Nunez

SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Dec 16, 2022

Siwen Luo, Feiqi Cao, Felipe Nunez, Zean Wen, Josiah Poon, Caren Han

Figure 1 for SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Figure 2 for SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Figure 3 for SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Figure 4 for SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Abstract:Most TextVQA approaches focus on the integration of objects, scene texts and question words by a simple transformer encoder. But this fails to capture the semantic relations between different modalities. The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We created a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperformed existing ones because of the scene graph and its attention modules.

Via

Access Paper or Ask Questions