Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Go Inoue

BALSAM: A Platform for Benchmarking Arabic Large Language Models

Jul 30, 2025

Rawan Al-Matham, Kareem Darwish, Raghad Al-Rasheed, Waad Alshammari, Muneera Alhoshan, Amal Almazrua, Asma Al Wazrah, Mais Alheraki, Firoj Alam, Preslav Nakov(+33 more)

Figure 1 for BALSAM: A Platform for Benchmarking Arabic Large Language Models

Figure 2 for BALSAM: A Platform for Benchmarking Arabic Large Language Models

Figure 3 for BALSAM: A Platform for Benchmarking Arabic Large Language Models

Figure 4 for BALSAM: A Platform for Benchmarking Arabic Large Language Models

Abstract:The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.

Via

Access Paper or Ask Questions

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

Jun 08, 2025

Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid(+19 more)

Abstract:Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.

Via

Access Paper or Ask Questions

Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

Jun 09, 2024

Salman Elgamal, Ossama Obeid, Tameem Kabbani, Go Inoue, Nizar Habash

Figure 1 for Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

Figure 2 for Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

Figure 3 for Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

Figure 4 for Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

Abstract:The widespread absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP). This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild," to unveil patterns and latent information across six diverse genres: news articles, novels, children's books, poetry, political documents, and ChatGPT outputs. We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context. Additionally, we propose extensions to the analyze-and-disambiguate approach in Arabic NLP to leverage these diacritics, resulting in notable improvements. Our contributions encompass a thorough analysis, valuable datasets, and an extended diacritization algorithm. We release our code and datasets as open source.

* Accepted to ACL 2024

Via

Access Paper or Ask Questions

Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation

May 24, 2023

Bashar Alhafni, Go Inoue, Christian Khairallah, Nizar Habash

Abstract:Grammatical error correction (GEC) is a well-explored problem in English with many existing models and datasets. However, research on GEC in morphologically rich languages has been limited due to challenges such as data scarcity and language complexity. In this paper, we present the first results on Arabic GEC by using two newly developed Transformer-based pretrained sequence-to-sequence models. We address the task of multi-class Arabic grammatical error detection (GED) and present the first results on multi-class Arabic GED. We show that using GED information as auxiliary input in GEC models improves GEC performance across three datasets spanning different genres. Moreover, we also investigate the use of contextual morphological preprocessing in aiding GEC systems. Our models achieve state-of-the-art results on two Arabic GEC shared tasks datasets and establish a strong benchmark on a newly created dataset.

Via

Access Paper or Ask Questions

Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Nov 30, 2022

Ossama Obeid, Go Inoue, Nizar Habash

Figure 1 for Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Figure 2 for Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Figure 3 for Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Figure 4 for Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Abstract:We present Camelira, a web-based Arabic multi-dialect morphological disambiguation tool that covers four major variants of Arabic: Modern Standard Arabic, Egyptian, Gulf, and Levantine. Camelira offers a user-friendly web interface that allows researchers and language learners to explore various linguistic information, such as part-of-speech, morphological features, and lemmas. Our system also provides an option to automatically choose an appropriate dialect-specific disambiguator based on the prediction of a dialect identification component. Camelira is publicly accessible at http://camelira.camel-lab.com.

Via

Access Paper or Ask Questions

Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects

Oct 13, 2021

Go Inoue, Salam Khalifa, Nizar Habash

Figure 1 for Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects

Figure 2 for Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects

Figure 3 for Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects

Figure 4 for Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects

Abstract:We present state-of-the-art results on morphosyntactic tagging across different varieties of Arabic using fine-tuned pre-trained transformer language models. Our models consistently outperform existing systems in Modern Standard Arabic and all the Arabic dialects we study, achieving 2.6% absolute improvement over the previous state-of-the-art in Modern Standard Arabic, 2.8% in Gulf, 1.6% in Egyptian, and 7.0% in Levantine. We explore different training setups for fine-tuning pre-trained transformer language models, including training data size, the use of external linguistic resources, and the use of annotated data from other dialects in a low-resource scenario. Our results show that strategic fine-tuning using datasets from other high-resource dialects is beneficial for a low-resource dialect. Additionally, we show that high-quality morphological analyzers as external linguistic resources are beneficial especially in low-resource settings.

Via

Access Paper or Ask Questions

The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models

Mar 11, 2021

Go Inoue, Bashar Alhafni, Nurpeiis Baimukan, Houda Bouamor, Nizar Habash

Figure 1 for The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models

Figure 2 for The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models

Figure 3 for The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models

Figure 4 for The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models

Abstract:In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.

* Accepted to WANLP 2021

Via

Access Paper or Ask Questions