Topic:Text Extraction From Documents
What is Text Extraction From Documents? Text extraction from documents is the process of extracting text data from scanned documents or images.
Papers and Code
Feb 18, 2025
Abstract:After pre-training on extensive image-text pairs, Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of non-paired data, such as multimodal interleaved documents, remains underutilized for vision-language representation learning. To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M. Extensive experiments demonstrate that RealSyn effectively advances vision-language representation learning and exhibits strong scalability. Models pre-trained on RealSyn achieve state-of-the-art performance on multiple downstream tasks. To facilitate future research, the RealSyn dataset and pre-trained model weights are released at https://github.com/deepglint/RealSyn.
Via
data:image/s3,"s3://crabby-images/f7f1d/f7f1d1457be1bb59e7f008a5a118c0bb3dfd89f0" alt="arxiv icon"
Feb 17, 2025
Abstract:Inspirational quotes from famous individuals are often used to convey thoughts in news articles, essays, and everyday conversations. In this paper, we propose a novel context-based quote extraction system that aims to extract the most relevant quote from a long text. We formulate this quote extraction as an open domain question answering problem first by employing a vector-store based retriever and then applying a multi-task reader. We curate three context-based quote extraction datasets and introduce a novel multi-task framework RA-MTR that improves the state-of-the-art performance, achieving a maximum improvement of 5.08% in BoW F1-score.
Via
data:image/s3,"s3://crabby-images/f7f1d/f7f1d1457be1bb59e7f008a5a118c0bb3dfd89f0" alt="arxiv icon"
Feb 13, 2025
Abstract:Graph-RAG constructs a knowledge graph from text chunks to improve retrieval in Large Language Model (LLM)-based question answering. It is particularly useful in domains such as biomedicine, law, and political science, where retrieval often requires multi-hop reasoning over proprietary documents. Some existing Graph-RAG systems construct KNN graphs based on text chunk relevance, but this coarse-grained approach fails to capture entity relationships within texts, leading to sub-par retrieval and generation quality. To address this, recent solutions leverage LLMs to extract entities and relationships from text chunks, constructing triplet-based knowledge graphs. However, this approach incurs significant indexing costs, especially for large document collections. To ensure a good result accuracy while reducing the indexing cost, we propose KET-RAG, a multi-granular indexing framework. KET-RAG first identifies a small set of key text chunks and leverages an LLM to construct a knowledge graph skeleton. It then builds a text-keyword bipartite graph from all text chunks, serving as a lightweight alternative to a full knowledge graph. During retrieval, KET-RAG searches both structures: it follows the local search strategy of existing Graph-RAG systems on the skeleton while mimicking this search on the bipartite graph to improve retrieval quality. We evaluate eight solutions on two real-world datasets, demonstrating that KET-RAG outperforms all competitors in indexing cost, retrieval effectiveness, and generation quality. Notably, it achieves comparable or superior retrieval quality to Microsoft's Graph-RAG while reducing indexing costs by over an order of magnitude. Additionally, it improves the generation quality by up to 32.4% while lowering indexing costs by around 20%.
Via
data:image/s3,"s3://crabby-images/f7f1d/f7f1d1457be1bb59e7f008a5a118c0bb3dfd89f0" alt="arxiv icon"
Feb 10, 2025
Abstract:Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical for modern AI systems, but manual curation struggles to scale with the rapid growth of scientific literature. This paper presents KARMA, a novel framework employing multi-agent large language models (LLMs) to automate KG enrichment through structured analysis of unstructured text. Our approach employs nine collaborative agents, spanning entity discovery, relation extraction, schema alignment, and conflict resolution that iteratively parse documents, verify extracted knowledge, and integrate it into existing graph structures while adhering to domain-specific schema. Experiments on 1,200 PubMed articles from three different domains demonstrate the effectiveness of KARMA in knowledge graph enrichment, with the identification of up to 38,230 new entities while achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\% through multi-layer assessments.
* 24 pages, 3 figures, 2 tables
Via
data:image/s3,"s3://crabby-images/f7f1d/f7f1d1457be1bb59e7f008a5a118c0bb3dfd89f0" alt="arxiv icon"
Feb 06, 2025
Abstract:Optical Character Recognition (OCR) technology is widely used to extract text from images of documents, facilitating efficient digitization and data retrieval. However, merely extracting text is insufficient when dealing with complex documents. Fully comprehending such documents requires an understanding of their structure -- including formatting, formulas, tables, and the reading order of multiple blocks and columns across multiple pages -- as well as semantic information for detecting elements like footnotes and image captions. This comprehensive understanding is crucial for downstream tasks such as retrieval, document question answering, and data curation for training Large Language Models (LLMs) and Vision Language Models (VLMs). To address this, we introduce \'Eclair, a general-purpose text-extraction tool specifically designed to process a wide range of document types. Given an image, \'Eclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes. To thoroughly evaluate these novel capabilities, we introduce our diverse human-annotated benchmark for document-level OCR and semantic classification. \'Eclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics. Additionally, we evaluate \'Eclair on established benchmarks, demonstrating its versatility and strength across several evaluation standards.
Via
data:image/s3,"s3://crabby-images/f7f1d/f7f1d1457be1bb59e7f008a5a118c0bb3dfd89f0" alt="arxiv icon"
Feb 07, 2025
Abstract:Converting images of Arabic text into plain text is a widely researched topic in academia and industry. However, recognition of Arabic handwritten and printed text presents difficult challenges due to the complex nature of variations of the Arabic script. This work proposes an end-to-end solution for recognizing Arabic handwritten, printed, and Arabic numbers and presents the data in a structured manner. We reached 81.66% precision, 78.82% Recall, and 79.07% F-measure on a Text Detection task that powers the proposed solution. The proposed recognition model incorporates state-of-the-art CNN-based feature extraction, and Transformer-based sequence modeling to accommodate variations in handwriting styles, stroke thicknesses, alignments, and noise conditions. The evaluation of the model suggests its strong performances on both printed and handwritten texts, yielding 0.59% CER and & 1.72% WER on printed text, and 7.91% CER and 31.41% WER on handwritten text. The overall proposed solution has proven to be relied on in real-life OCR tasks. Equipped with both detection and recognition models as well as other Feature Extraction and Matching helping algorithms. With the general purpose implementation, making the solution valid for any given document or receipt that is Arabic handwritten or printed. Thus, it is practical and useful for any given context.
Via
data:image/s3,"s3://crabby-images/f7f1d/f7f1d1457be1bb59e7f008a5a118c0bb3dfd89f0" alt="arxiv icon"
Feb 05, 2025
Abstract:Diagrams play a crucial role in visually conveying complex relationships and processes within business documentation. Despite recent advances in Vision-Language Models (VLMs) for various image understanding tasks, accurately identifying and extracting the structures and relationships depicted in diagrams continues to pose significant challenges. This study addresses these challenges by proposing a text-driven approach that bypasses reliance on VLMs' visual recognition capabilities. Instead, it utilizes the editable source files--such as xlsx, pptx or docx--where diagram elements (e.g., shapes, lines, annotations) are preserved as textual metadata. In our proof-of-concept, we extracted diagram information from xlsx-based system design documents and transformed the extracted shape data into textual input for Large Language Models (LLMs). This approach allowed the LLM to analyze relationships and generate responses to business-oriented questions without the bottleneck of image-based processing. Experimental comparisons with a VLM-based method demonstrated that the proposed text-driven framework yielded more accurate answers for questions requiring detailed comprehension of diagram structures.The results obtained in this study are not limited to the tested .xlsx files but can also be extended to diagrams in other documents with source files, such as Office pptx and docx formats. These findings highlight the feasibility of circumventing VLM constraints through direct textual extraction from original source files. By enabling robust diagram understanding through LLMs, our method offers a promising path toward enhanced workflow efficiency and information analysis in real-world business scenarios.
* The related code is available at
\url{https://github.com/galirage/spreadsheet-intelligence}, which provides
the core library developed for this research. The experimental code using
this library can be found at
\url{https://github.com/galirage/XMLDriven-Diagram-Understanding}
Via
data:image/s3,"s3://crabby-images/f7f1d/f7f1d1457be1bb59e7f008a5a118c0bb3dfd89f0" alt="arxiv icon"
Feb 05, 2025
Abstract:In todays age of freely available information, policy makers have to take into account a huge amount of information while making decisions affecting relevant stakeholders. While increase in the amount of information sources and documents increases credibility of decisions based on the corpus of available text, it is challenging for policymakers to make sense of this information. This paper demonstrates how policy makers can implement some of the most popular topic recognition methods, Latent Dirichlet Allocation, Deep Distributed Representation method, text summarization approaches, Word Based Sentence Ranking method and TextRank for sentence extraction method, to sum up the content of large volume of documents to understand the gist of the overload of information. We have applied popular NLP methods to corporate press releases during the early period and advanced period of Covid-19 pandemic which has resulted in a global unprecedented health and socio-economic crisis, when policymaking and regulations have become especially important to standardize corporate practices for employee and social welfare in the face of similar future unseen crises. The steps undertaken in this study can be replicated to yield insights from relevant documents in any other social decision-making context.
* 7 Tables
Via
data:image/s3,"s3://crabby-images/f7f1d/f7f1d1457be1bb59e7f008a5a118c0bb3dfd89f0" alt="arxiv icon"
Feb 05, 2025
Abstract:Objective: To evaluate the accuracy, computational cost and portability of a new Natural Language Processing (NLP) method for extracting medication information from clinical narratives. Materials and Methods: We propose an original transformer-based architecture for the extraction of entities and their relations pertaining to patients' medication regimen. First, we used this approach to train and evaluate a model on French clinical notes, using a newly annotated corpus from H\^opitaux Universitaires de Strasbourg. Second, the portability of the approach was assessed by conducting an evaluation on clinical documents in English from the 2018 n2c2 shared task. Information extraction accuracy and computational cost were assessed by comparison with an available method using transformers. Results: The proposed architecture achieves on the task of relation extraction itself performance that are competitive with the state-of-the-art on both French and English (F-measures 0.82 and 0.96 vs 0.81 and 0.95), but reduce the computational cost by 10. End-to-end (Named Entity recognition and Relation Extraction) F1 performance is 0.69 and 0.82 for French and English corpus. Discussion: While an existing system developed for English notes was deployed in a French hospital setting with reasonable effort, we found that an alternative architecture offered end-to-end drug information extraction with comparable extraction performance and lower computational impact for both French and English clinical text processing, respectively. Conclusion: The proposed architecture can be used to extract medication information from clinical text with high performance and low computational cost and consequently suits with usually limited hospital IT resources
* Submitted to JAMIA, 17 pages, 3 figures, 2 tables and 5 supplementary
tables
Via
data:image/s3,"s3://crabby-images/f7f1d/f7f1d1457be1bb59e7f008a5a118c0bb3dfd89f0" alt="arxiv icon"
Feb 03, 2025
Abstract:Hierarchical Merging is a technique commonly used to summarize very long texts ($>$100K tokens) by breaking down the input into smaller sections, summarizing those sections individually, and then merging or combining those summaries into a final coherent summary. Although it helps address the limitations of large language models (LLMs) with fixed input length constraints, the recursive merging process can amplify LLM hallucinations, increasing the risk of factual inaccuracies. In this paper, we seek to mitigate hallucinations by enriching hierarchical merging with context from the source document. Specifically, we propose different approaches to contextual augmentation ranging from \emph{replacing} intermediate summaries with relevant input context, to \emph{refining} them while using the context as supporting evidence, and \emph{aligning} them implicitly (via citations) to the input. Experimental results on datasets representing legal and narrative domains show that contextual augmentation consistently outperforms zero-shot and hierarchical merging baselines for the Llama 3.1 model family. Our analysis further reveals that refinement methods tend to perform best when paired with extractive summarization for identifying relevant input.
* 30 pages
Via
data:image/s3,"s3://crabby-images/f7f1d/f7f1d1457be1bb59e7f008a5a118c0bb3dfd89f0" alt="arxiv icon"