Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cesar Berrospi

IBM Research Zurich

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Nov 29, 2024

Rafael Teixeira de Lima, Shubham Gupta, Cesar Berrospi, Lokesh Mishra, Michele Dolfi, Peter Staar, Panagiotis Vagenas

Figure 1 for Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Figure 2 for Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Figure 3 for Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Figure 4 for Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Abstract:Retrieval Augmented Generation (RAG) systems are a widespread application of Large Language Models (LLMs) in the industry. While many tools exist empowering developers to build their own systems, measuring their performance locally, with datasets reflective of the system's use cases, is a technological challenge. Solutions to this problem range from non-specific and cheap (most public datasets) to specific and costly (generating data from local documents). In this paper, we show that using public question and answer (Q&A) datasets to assess retrieval performance can lead to non-optimal systems design, and that common tools for RAG dataset generation can lead to unbalanced data. We propose solutions to these issues based on the characterization of RAG datasets through labels and through label-targeted data generation. Finally, we show that fine-tuned small LLMs can efficiently generate Q&A datasets. We believe that these observations are invaluable to the know-your-data step of RAG systems development.

* to be published in the 31st International Conference on Computational Linguistics (COLING 2025)

Via

Access Paper or Ask Questions

ESG Accountability Made Easy: DocQA at Your Service

Nov 30, 2023

Lokesh Mishra, Cesar Berrospi, Kasper Dinkla, Diego Antognini, Francesco Fusco, Benedikt Bothur, Maksym Lysak, Nikolaos Livathinos, Ahmed Nassar, Panagiotis Vagenas(+4 more)

Figure 1 for ESG Accountability Made Easy: DocQA at Your Service

Figure 2 for ESG Accountability Made Easy: DocQA at Your Service

Abstract:We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations. The Deep Search platform can be accessed at: https://ds4sd.github.io.

* Accepted at the Demonstration Track of the 38th Annual AAAI Conference on Artificial Intelligence (AAAI 24)

Via

Access Paper or Ask Questions

Robust PDF Document Conversion Using Recurrent Neural Networks

Feb 18, 2021

Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, Peter Staar

Figure 1 for Robust PDF Document Conversion Using Recurrent Neural Networks

Figure 2 for Robust PDF Document Conversion Using Recurrent Neural Networks

Figure 3 for Robust PDF Document Conversion Using Recurrent Neural Networks

Figure 4 for Robust PDF Document Conversion Using Recurrent Neural Networks

Abstract:The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretation of the rendered PDF page, as has been proposed in previous literature. We demonstrate how a sequence of PDF printing commands can be used as input into a neural network and how the network can learn to classify each printing command according to its structural function in the page. This approach has three advantages: First, it can distinguish among more fine-grained labels (typically 10-20 labels as opposed to 1-5 with visual methods), which results in a more accurate and detailed document structure resolution. Second, it can take into account the text flow across pages more naturally compared to visual methods because it can concatenate the printing commands of sequential pages. Last, our proposed method needs less memory and it is computationally less expensive than visual methods. This allows us to deploy such models in production environments at a much lower cost. Through extensive architectural search in combination with advanced feature engineering, we were able to implement a model that yields a weighted average F1 score of 97% across 17 distinct structural labels. The best model we achieved is currently served in production environments on our Corpus Conversion Service (CCS), which was presented at KDD18 (arXiv:1806.02284). This model enhances the capabilities of CCS significantly, as it eliminates the need for human annotated label ground-truth for every unseen document layout. This proved particularly useful when applied to a huge corpus of PDF articles related to COVID-19.

* 9 pages, 2 tables, 4 figures, uses aaai21.sty. Accepted at the "Thirty-Third Annual Conference on Innovative Applications of Artificial Intelligence (IAAI-21)". Received the "IAAI-21 Innovative Application Award"

Via

Access Paper or Ask Questions