Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Apr 10, 2024

Mirelle Bueno, Eduardo Seiti de Oliveira, Rodrigo Nogueira, Roberto A. Lotufo, Jayr Alencar Pereira

Figure 1 for Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Figure 2 for Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Figure 3 for Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Figure 4 for Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Share this with someone who'll enjoy it:

Abstract:Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portuguese websites. These websites are frequented more likely by real users compared to those randomly scraped, ensuring a more representative and relevant corpus. To label the query-document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. We provide a detailed description of our annotation methodology to enable others to create similar datasets for other languages, providing a cost-effective way of creating high-quality IR datasets with an arbitrary number of labeled documents per query. Finally, we evaluate a diverse range of open-source and commercial retrievers to serve as baseline systems. Quati is publicly available at https://huggingface.co/datasets/unicamp-dl/quati and all scripts at https://github.com/unicamp-dl/quati .

* 22 pages

View paper on

Share this with someone who'll enjoy it:

Title:Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Paper and Code