Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction

Nov 02, 2024

Rujiao Long, Pengfei Wang, Zhibo Yang, Cong Yao

Figure 1 for HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction

Figure 2 for HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction

Figure 3 for HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction

Figure 4 for HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction

Share this with someone who'll enjoy it:

Abstract:End-to-end visual information extraction (VIE) aims at integrating the hierarchical subtasks of VIE, including text spotting, word grouping, and entity labeling, into a unified framework. Dealing with the gaps among the three subtasks plays a pivotal role in designing an effective VIE model. OCR-dependent methods heavily rely on offline OCR engines and inevitably suffer from OCR errors, while OCR-free methods, particularly those employing a black-box model, might produce outputs that lack interpretability or contain hallucinated content. Inspired by CenterNet, DeepSolo, and ESP, we propose HIP, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task. Specifically, such hierarchical points can be flexibly encoded and subsequently decoded into desired text transcripts, centers of various regions, and categories of entities. Furthermore, we devise corresponding hierarchical pre-training strategies, categorized as image reconstruction, layout learning, and language enhancement, to reinforce the cross-modality representation of the hierarchical encoders. Quantitative experiments on public benchmarks demonstrate that HIP outperforms previous state-of-the-art methods, while qualitative results show its excellent interpretability.

View paper on

Share this with someone who'll enjoy it:

Title:HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction

Paper and Code