Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:From N-grams to Pre-trained Multilingual Models For Language Identification

Oct 11, 2024

Thapelo Sindane, Vukosi Marivate

Figure 1 for From N-grams to Pre-trained Multilingual Models For Language Identification

Figure 2 for From N-grams to Pre-trained Multilingual Models For Language Identification

Figure 3 for From N-grams to Pre-trained Multilingual Models For Language Identification

Figure 4 for From N-grams to Pre-trained Multilingual Models For Language Identification

Share this with someone who'll enjoy it:

Abstract:In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages. For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages, that efficiently model each language, thus, improving language ranking. For pre-trained multilingual models, we conduct extensive experiments covering a diverse set of massively pre-trained multilingual (PLM) models -- mBERT, RemBERT, XLM-r, and Afri-centric multilingual models -- AfriBERTa, Afro-XLMr, AfroLM, and Serengeti. We further compare these models with available large-scale Language Identification tools: Compact Language Detector v3 (CLD V3), AfroLID, GlotLID, and OpenLID to highlight the importance of focused-based LID. From these, we show that Serengeti is a superior model across models: N-grams to Transformers on average. Moreover, we propose a lightweight BERT-based LID model (za_BERT_lid) trained with NHCLT + Vukzenzele corpus, which performs on par with our best-performing Afri-centric models.

* The paper has been accepted at The 4th International Conference on Natural Language Processing for Digital Humanities (NLP4DH 2024)

View paper on

Share this with someone who'll enjoy it:

Title:From N-grams to Pre-trained Multilingual Models For Language Identification

Paper and Code