Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Changcun Bao

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Sep 03, 2023

Haoyu Cao, Changcun Bao, Chaohu Liu, Huang Chen, Kun Yin, Hao Liu, Yinsong Liu, Deqiang Jiang, Xing Sun

Figure 1 for Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Figure 2 for Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Figure 3 for Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Figure 4 for Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Abstract:We propose a novel end-to-end document understanding model called SeRum (SElective Region Understanding Model) for extracting meaningful information from document images, including document analysis, retrieval, and office automation. Unlike state-of-the-art approaches that rely on multi-stage technical schemes and are computationally expensive, SeRum converts document image understanding and recognition tasks into a local decoding process of the visual tokens of interest, using a content-aware token merge module. This mechanism enables the model to pay more attention to regions of interest generated by the query decoder, improving the model's effectiveness and speeding up the decoding speed of the generative scheme. We also designed several pre-training tasks to enhance the understanding and local awareness of the model. Experimental results demonstrate that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks. SeRum represents a substantial advancement towards enabling efficient and effective end-to-end document understanding.

* Accepted to ICCV 2023 main conference

Via

Access Paper or Ask Questions

Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

Apr 04, 2023

Yongxin Zhu, Zhen Liu, Yukang Liang, Xin Li, Hao Liu, Changcun Bao, Linli Xu

Figure 1 for Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

Figure 2 for Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

Figure 3 for Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

Figure 4 for Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

Abstract:In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (STVQA), which requires models to read scene text in images for question answering. Apart from text or visual objects, which could exist independently, scene text naturally links text and visual modalities together by conveying linguistic semantics while being a visual object in an image simultaneously. Different to conventional STVQA models which take the linguistic semantics and visual semantics in scene text as two separate features, in this paper, we propose a paradigm of "Locate Then Generate" (LTG), which explicitly unifies this two semantics with the spatial bounding box as a bridge connecting them. Specifically, at first, LTG locates the region in an image that may contain the answer words with an answer location module (ALM) consisting of a region proposal network and a language refinement network, both of which can transform to each other with one-to-one mapping via the scene text bounding box. Next, given the answer words selected by ALM, LTG generates a readable answer sequence with an answer generation module (AGM) based on a pre-trained language model. As a benefit of the explicit alignment of the visual and linguistic semantics, even without any scene text based pre-training tasks, LTG can boost the absolute accuracy by +6.06% and +6.92% on the TextVQA dataset and the ST-VQA dataset respectively, compared with a non-pre-training baseline. We further demonstrate that LTG effectively unifies visual and text modalities through the spatial bounding box connection, which is underappreciated in previous methods.

* accepted in AAAI 2023

Via

Access Paper or Ask Questions