Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kola Tubosun

ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus

Jul 29, 2023

Tolulope Ogunremi, Kola Tubosun, Anuoluwapo Aremu, Iroro Orife, David Ifeoluwa Adelani

Figure 1 for ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus

Figure 2 for ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus

Figure 3 for ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus

Figure 4 for ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus

Abstract:We introduce the \`{I}r\`{o}y\`{i}nSpeech corpus -- a new dataset influenced by a desire to increase the amount of high quality, freely available, contemporary Yor\`{u}b\'{a} speech. We release a multi-purpose dataset that can be used for both TTS and ASR tasks. We curated text sentences from the news and creative writing domains under an open license i.e., CC-BY-4.0 and had multiple speakers record each sentence. We provide 5000 of our utterances to the Common Voice platform to crowdsource transcriptions online. The dataset has 38.5 hours of data in total, recorded by 80 volunteers.

* working paper

Via

Access Paper or Ask Questions

Improving Yorùbá Diacritic Restoration

Mar 23, 2020

Iroro Orife, David I. Adelani, Timi Fasubaa, Victor Williamson, Wuraola Fisayo Oyewusi, Olamilekan Wahab, Kola Tubosun

Figure 1 for Improving Yorùbá Diacritic Restoration

Figure 2 for Improving Yorùbá Diacritic Restoration

Figure 3 for Improving Yorùbá Diacritic Restoration

Figure 4 for Improving Yorùbá Diacritic Restoration

Abstract:Yor\`ub\'a is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics. They provide morphological information, are crucial for lexical disambiguation, pronunciation and are vital for any computational Speech or Natural Language Processing tasks. However diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage. We report on recent efforts at dataset cultivation. By aggregating and improving disparate texts from the web and various personal libraries, we were able to significantly grow our clean Yor\`ub\'a dataset from a majority Bibilical text corpora with three sources to millions of tokens from over a dozen sources. We evaluate updated diacritic restoration models on a new, general purpose, public-domain Yor\`ub\'a evaluation dataset of modern journalistic news text, selected to be multi-purpose and reflecting contemporary usage. All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yor\`ub\'a language technology.

* Accepted to ICLR 2020 AfricaNLP workshop

Via

Access Paper or Ask Questions