Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Efficient End-to-End Visual Document Understanding with Rationale Distillation

Nov 16, 2023

Wang Zhu, Alekh Agarwal, Mandar Joshi, Robin Jia, Jesse Thomason, Kristina Toutanova

Figure 1 for Efficient End-to-End Visual Document Understanding with Rationale Distillation

Figure 2 for Efficient End-to-End Visual Document Understanding with Rationale Distillation

Figure 3 for Efficient End-to-End Visual Document Understanding with Rationale Distillation

Figure 4 for Efficient End-to-End Visual Document Understanding with Rationale Distillation

Share this with someone who'll enjoy it:

Abstract:Understanding visually situated language requires recognizing text and visual elements, and interpreting complex layouts. State-of-the-art methods commonly use specialized pre-processing tools, such as optical character recognition (OCR) systems, that map document image inputs to extracted information in the space of textual tokens, and sometimes also employ large language models (LLMs) to reason in text token space. However, the gains from external tools and LLMs come at the cost of increased computational and engineering complexity. In this paper, we ask whether small pretrained image-to-text models can learn selective text or layout recognition and reasoning as an intermediate inference step in an end-to-end model for pixel-level visual language understanding. We incorporate the outputs of such OCR tools, LLMs, and larger multimodal models as intermediate ``rationales'' on training data, and train a small student model to predict both rationales and answers for input questions based on those training examples. A student model based on Pix2Struct (282M parameters) achieves consistent improvements on three visual document understanding benchmarks representing infographics, scanned documents, and figures, with improvements of more than 4\% absolute over a comparable Pix2Struct model that predicts answers directly.

* 17 pages, 7 figures

View paper on

Share this with someone who'll enjoy it:

Title:Efficient End-to-End Visual Document Understanding with Rationale Distillation

Paper and Code