Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sarajane M. Peres

A RAG-Based Institutional Assistant

Jan 23, 2025

Gustavo Kuratomi, Paulo Pirozelli, Fabio G. Cozman, Sarajane M. Peres

Abstract:Although large language models (LLMs) demonstrate strong text generation capabilities, they struggle in scenarios requiring access to structured knowledge bases or specific documents, limiting their effectiveness in knowledge-intensive tasks. To address this limitation, retrieval-augmented generation (RAG) models have been developed, enabling generative models to incorporate relevant document fragments into their inputs. In this paper, we design and evaluate a RAG-based virtual assistant specifically tailored for the University of S\~ao Paulo. Our system architecture comprises two key modules: a retriever and a generative model. We experiment with different types of models for both components, adjusting hyperparameters such as chunk size and the number of retrieved documents. Our optimal retriever model achieves a Top-5 accuracy of 30%, while our most effective generative model scores 22.04\% against ground truth answers. Notably, when the correct document chunks are supplied to the LLMs, accuracy significantly improves to 54.02%, an increase of over 30 percentage points. Conversely, without contextual input, performance declines to 13.68%. These findings highlight the critical role of database access in enhancing LLM performance. They also reveal the limitations of current semantic search methods in accurately identifying relevant documents and underscore the ongoing challenges LLMs face in generating precise responses.

Via

Access Paper or Ask Questions

Benchmarks for Pirá 2.0, a Reading Comprehension Dataset about the Ocean, the Brazilian Coast, and Climate Change

Sep 19, 2023

Paulo Pirozelli, Marcos M. José, Igor Silveira, Flávio Nakasato, Sarajane M. Peres, Anarosa A. F. Brandão, Anna H. R. Costa, Fabio G. Cozman

Abstract:Pir\'a is a reading comprehension dataset focused on the ocean, the Brazilian coast, and climate change, built from a collection of scientific abstracts and reports on these topics. This dataset represents a versatile language resource, particularly useful for testing the ability of current machine learning models to acquire expert scientific knowledge. Despite its potential, a detailed set of baselines has not yet been developed for Pir\'a. By creating these baselines, researchers can more easily utilize Pir\'a as a resource for testing machine learning models across a wide range of question answering tasks. In this paper, we define six benchmarks over the Pir\'a dataset, covering closed generative question answering, machine reading comprehension, information retrieval, open question answering, answer triggering, and multiple choice question answering. As part of this effort, we have also produced a curated version of the original dataset, where we fixed a number of grammar issues, repetitions, and other shortcomings. Furthermore, the dataset has been extended in several new directions, so as to face the aforementioned benchmarks: translation of supporting texts from English into Portuguese, classification labels for answerability, automatic paraphrases of questions and answers, and multiple choice candidates. The results described in this paper provide several points of reference for researchers interested in exploring the challenges provided by the Pir\'a dataset.

* Accepted at Data Intelligence. Online ISSN 2641-435X

Via

Access Paper or Ask Questions

The BLue Amazon Brain : A Modular Architecture of Services about the Brazilian Maritime Territory

Sep 06, 2022

Paulo Pirozelli, Ais B. R. Castro, Ana Luiza C. de Oliveira, André S. Oliveira, Flávio N. Cação, Igor C. Silveira, João G. M. Campos, Laura C. Motheo, Leticia F. Figueiredo, Lucas F. A. O. Pellicer(+12 more)

Figure 1 for The BLue Amazon Brain : A Modular Architecture of Services about the Brazilian Maritime Territory

Figure 2 for The BLue Amazon Brain : A Modular Architecture of Services about the Brazilian Maritime Territory

Figure 3 for The BLue Amazon Brain : A Modular Architecture of Services about the Brazilian Maritime Territory

Abstract:We describe the first steps in the development of an artificial agent focused on the Brazilian maritime territory, a large region within the South Atlantic also known as the Blue Amazon. The "BLue Amazon Brain" (BLAB) integrates a number of services aimed at disseminating information about this region and its importance, functioning as a tool for environmental awareness. The main service provided by BLAB is a conversational facility that deals with complex questions about the Blue Amazon, called BLAB-Chat; its central component is a controller that manages several task-oriented natural language processing modules (e.g., question answering and summarizer systems). These modules have access to an internal data lake as well as to third-party databases. A news reporter (BLAB-Reporter) and a purposely-developed wiki (BLAB-Wiki) are also part of the BLAB service architecture. In this paper, we describe our current version of BLAB's architecture (interface, backend, web services, NLP modules, and resources) and comment on the challenges we have faced so far, such as the lack of training data and the scattered state of domain information. Solving these issues presents a considerable challenge in the development of artificial intelligence for technical domains.

* AI: Modeling Oceans and Climate Change (IJCAI-ECAI), 2022

Via

Access Paper or Ask Questions

Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

Feb 04, 2022

André F. A. Paschoal, Paulo Pirozelli, Valdinei Freire, Karina V. Delgado, Sarajane M. Peres, Marcos M. José, Flávio Nakasato, André S. Oliveira, Anarosa A. F. Brandão, Anna H. R. Costa(+1 more)

Figure 1 for Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

Figure 2 for Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

Figure 3 for Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

Figure 4 for Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

Abstract:Current research in natural language processing is highly dependent on carefully produced corpora. Most existing resources focus on English; some resources focus on languages such as Chinese and French; few resources deal with more than one language. This paper presents the Pir\'a dataset, a large set of questions and answers about the ocean and the Brazilian coast both in Portuguese and English. Pir\'a is, to the best of our knowledge, the first QA dataset with supporting texts in Portuguese, and, perhaps more importantly, the first bilingual QA dataset that includes this language. The Pir\'a dataset consists of 2261 properly curated question/answer (QA) sets in both languages. The QA sets were manually created based on two corpora: abstracts related to the Brazilian coast and excerpts of United Nation reports about the ocean. The QA sets were validated in a peer-review process with the dataset contributors. We discuss some of the advantages as well as limitations of Pir\'a, as this new resource can support a set of tasks in NLP such as question-answering, information retrieval, and machine translation.

* CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021
* https://github.com/C4AI/Pira

Via

Access Paper or Ask Questions