Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Jul 27, 2024

Ruiyi Zhang, Yufan Zhou, Jian Chen, Jiuxiang Gu, Changyou Chen, Tong Sun

Figure 1 for LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Figure 2 for LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Figure 3 for LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Figure 4 for LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Share this with someone who'll enjoy it:

Abstract:Large multimodal language models have demonstrated impressive capabilities in understanding and manipulating images. However, many of these models struggle with comprehending intensive textual contents embedded within the images, primarily due to the limited text recognition and layout understanding ability. To understand the sources of these limitations, we perform an exploratory analysis showing the drawbacks of classical visual encoders on visual text understanding. Hence, we present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder. Our model surpasses existing state-of-the-art models in various text-rich image understanding tasks, showcasing enhanced comprehension of textual content within images. Together, our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.

* NeurIPS 2024 Under Review

View paper on

Share this with someone who'll enjoy it:

Title:LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Paper and Code