Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ossama Obeid

BALSAM: A Platform for Benchmarking Arabic Large Language Models

Jul 30, 2025

Rawan Al-Matham, Kareem Darwish, Raghad Al-Rasheed, Waad Alshammari, Muneera Alhoshan, Amal Almazrua, Asma Al Wazrah, Mais Alheraki, Firoj Alam, Preslav Nakov(+33 more)

Figure 1 for BALSAM: A Platform for Benchmarking Arabic Large Language Models

Figure 2 for BALSAM: A Platform for Benchmarking Arabic Large Language Models

Figure 3 for BALSAM: A Platform for Benchmarking Arabic Large Language Models

Figure 4 for BALSAM: A Platform for Benchmarking Arabic Large Language Models

Abstract:The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.

Via

Access Paper or Ask Questions

Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

Jun 09, 2024

Salman Elgamal, Ossama Obeid, Tameem Kabbani, Go Inoue, Nizar Habash

Figure 1 for Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

Figure 2 for Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

Figure 3 for Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

Figure 4 for Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

Abstract:The widespread absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP). This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild," to unveil patterns and latent information across six diverse genres: news articles, novels, children's books, poetry, political documents, and ChatGPT outputs. We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context. Additionally, we propose extensions to the analyze-and-disambiguate approach in Arabic NLP to leverage these diacritics, resulting in notable improvements. Our contributions encompass a thorough analysis, valuable datasets, and an extended diacritization algorithm. We release our code and datasets as open source.

* Accepted to ACL 2024

Via

Access Paper or Ask Questions

Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Nov 30, 2022

Ossama Obeid, Go Inoue, Nizar Habash

Figure 1 for Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Figure 2 for Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Figure 3 for Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Figure 4 for Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Abstract:We present Camelira, a web-based Arabic multi-dialect morphological disambiguation tool that covers four major variants of Arabic: Modern Standard Arabic, Egyptian, Gulf, and Levantine. Camelira offers a user-friendly web interface that allows researchers and language learners to explore various linguistic information, such as part-of-speech, morphological features, and lemmas. Our system also provides an option to automatically choose an appropriate dialect-specific disambiguator based on the prediction of a dialect identification component. Camelira is publicly accessible at http://camelira.camel-lab.com.

Via

Access Paper or Ask Questions

The User-Aware Arabic Gender Rewriter

Oct 14, 2022

Bashar Alhafni, Ossama Obeid, Nizar Habash

Figure 1 for The User-Aware Arabic Gender Rewriter

Figure 2 for The User-Aware Arabic Gender Rewriter

Figure 3 for The User-Aware Arabic Gender Rewriter

Abstract:We introduce the User-Aware Arabic Gender Rewriter, a user-centric web-based system for Arabic gender rewriting in contexts involving two users. The system takes either Arabic or English sentences as input, and provides users with the ability to specify their desired first and/or second person target genders. The system outputs gender rewritten alternatives of the Arabic input sentences (or their Arabic translations in case of English input) to match the target users' gender preferences.

Via

Access Paper or Ask Questions

MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction

Aug 25, 2018

Ossama Obeid, Salam Khalifa, Nizar Habash, Houda Bouamor, Wajdi Zaghouani, Kemal Oflazer

Figure 1 for MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction

Figure 2 for MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction

Figure 3 for MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction

Figure 4 for MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction

Abstract:In this paper, we introduce MADARi, a joint morphological annotation and spelling correction system for texts in Standard and Dialectal Arabic. The MADARi framework provides intuitive interfaces for annotating text and managing the annotation process of a large number of sizable documents. Morphological annotation includes indicating, for a word, in context, its baseword, clitics, part-of-speech, lemma, gloss, and dialect identification. MADARi has a suite of utilities to help with annotator productivity. For example, annotators are provided with pre-computed analyses to assist them in their task and reduce the amount of work needed to complete it. MADARi also allows annotators to query a morphological analyzer for a list of possible analyses in multiple dialects or look up previously submitted analyses. The MADARi management interface enables a lead annotator to easily manage and organize the whole annotation process remotely and concurrently. We describe the motivation, design and implementation of this interface; and we present details from a user study working with this system.

* Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Via

Access Paper or Ask Questions