Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Does Corpus Quality Really Matter for Low-Resource Languages?

Mar 15, 2022

Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de-Viñaspre, Aitor Soroa

Figure 1 for Does Corpus Quality Really Matter for Low-Resource Languages?

Figure 2 for Does Corpus Quality Really Matter for Low-Resource Languages?

Figure 3 for Does Corpus Quality Really Matter for Low-Resource Languages?

Figure 4 for Does Corpus Quality Really Matter for Low-Resource Languages?

Share this with someone who'll enjoy it:

Abstract:The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with <33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is primarily constrained by the quantity rather than the quality of the data, prompting for methods to exploit more diverse data sources.

View paper on

Share this with someone who'll enjoy it:

Title:Does Corpus Quality Really Matter for Low-Resource Languages?

Paper and Code