Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Olamilekan Wahab

Improving Yorùbá Diacritic Restoration

Mar 23, 2020

Iroro Orife, David I. Adelani, Timi Fasubaa, Victor Williamson, Wuraola Fisayo Oyewusi, Olamilekan Wahab, Kola Tubosun

Figure 1 for Improving Yorùbá Diacritic Restoration

Figure 2 for Improving Yorùbá Diacritic Restoration

Figure 3 for Improving Yorùbá Diacritic Restoration

Figure 4 for Improving Yorùbá Diacritic Restoration

Abstract:Yor\`ub\'a is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics. They provide morphological information, are crucial for lexical disambiguation, pronunciation and are vital for any computational Speech or Natural Language Processing tasks. However diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage. We report on recent efforts at dataset cultivation. By aggregating and improving disparate texts from the web and various personal libraries, we were able to significantly grow our clean Yor\`ub\'a dataset from a majority Bibilical text corpora with three sources to millions of tokens from over a dozen sources. We evaluate updated diacritic restoration models on a new, general purpose, public-domain Yor\`ub\'a evaluation dataset of modern journalistic news text, selected to be multi-purpose and reflecting contemporary usage. All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yor\`ub\'a language technology.

* Accepted to ICLR 2020 AfricaNLP workshop

Via

Access Paper or Ask Questions

NASS-AI: Towards Digitization of Parliamentary Bills using Document Level Embedding and Bidirectional Long Short-Term Memory

Oct 02, 2019

Adewale Akinfaderin, Olamilekan Wahab

Figure 1 for NASS-AI: Towards Digitization of Parliamentary Bills using Document Level Embedding and Bidirectional Long Short-Term Memory

Figure 2 for NASS-AI: Towards Digitization of Parliamentary Bills using Document Level Embedding and Bidirectional Long Short-Term Memory

Figure 3 for NASS-AI: Towards Digitization of Parliamentary Bills using Document Level Embedding and Bidirectional Long Short-Term Memory

Figure 4 for NASS-AI: Towards Digitization of Parliamentary Bills using Document Level Embedding and Bidirectional Long Short-Term Memory

Abstract:There has been several reports in the Nigerian and International media about the Senators and House of Representative Members of the Nigerian National Assembly (NASS) being the highest paid in the world. Despite this high-level of parliamentary compensation and a lack of oversight, most of the legislative duties like bills introduced and vote proceedings are shrouded in mystery without an open and annotated corpus. In this paper, we present results from ongoing research on the categorization of bills introduced in the Nigerian parliament since the fourth republic (1999 - 2018). For this task, we employed a multi-step approach which involves extracting text from scanned and embedded pdfs with low to medium quality using Optical Character Recognition (OCR) tools and labeling them into eight categories. We investigate the performance of document level embedding for feature representation of the extracted texts before using a Bidirectional Long Short-Term Memory (Bi-LSTM) for our classifier. The performance was further compared with other feature representation and machine learning techniques. We believe that these results are well-positioned to have a substantial impact on the quest to meet the basic open data charter principles.

* Presented at NeurIPS 2019 Workshop on Machine Learning for the Developing World

Via

Access Paper or Ask Questions