Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Separate and Locate: Rethink the Text in Text-based Visual Question Answering

Aug 31, 2023

Chengyang Fang, Jiangnan Li, Liang Li, Can Ma, Dayong Hu

Figure 1 for Separate and Locate: Rethink the Text in Text-based Visual Question Answering

Figure 2 for Separate and Locate: Rethink the Text in Text-based Visual Question Answering

Figure 3 for Separate and Locate: Rethink the Text in Text-based Visual Question Answering

Figure 4 for Separate and Locate: Rethink the Text in Text-based Visual Question Answering

Share this with someone who'll enjoy it:

Abstract:Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ``sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, we propose a Text Semantic Separate (TSS) module that helps the model recognize whether words have semantic contextual relations. Then, we introduce a Spatial Circle Position (SCP) module that helps the model better construct and reason the spatial position relationships between OCR texts. Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets. Compared with the pre-training state-of-the-art method pre-trained on 64 million pre-training samples, our method, without any pre-training tasks, still achieves 2.68% and 2.52% accuracy improvement on TextVQA and ST-VQA. Our code and models will be released at https://github.com/fangbufang/SaL.

* Accepted by ACM MM 2023

View paper on

Share this with someone who'll enjoy it:

Title:Separate and Locate: Rethink the Text in Text-based Visual Question Answering

Paper and Code