Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kareem Darwish

BALSAM: A Platform for Benchmarking Arabic Large Language Models

Jul 30, 2025

Rawan Al-Matham, Kareem Darwish, Raghad Al-Rasheed, Waad Alshammari, Muneera Alhoshan, Amal Almazrua, Asma Al Wazrah, Mais Alheraki, Firoj Alam, Preslav Nakov(+33 more)

Abstract:The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.

Via

Access Paper or Ask Questions

Creating Arabic LLM Prompts at Scale

Aug 12, 2024

Abdelrahman El-Sheikh, Ahmed Elmogtaba, Kareem Darwish, Muhammad Elmallah, Ashraf Elneima, Hassan Sawaf

Figure 1 for Creating Arabic LLM Prompts at Scale

Abstract:The debut of chatGPT and BARD has popularized instruction following text generation using LLMs, where a user can interrogate an LLM using natural language requests and obtain natural language answers that matches their requests. Training LLMs to respond in this manner requires a large number of worked out examples of user requests (aka prompts) with corresponding gold responses. In this paper, we introduce two methods for creating such prompts for Arabic cheaply and quickly. The first methods entails automatically translating existing prompt datasets from English, such as PromptSource and Super-NaturalInstructions, and then using machine translation quality estimation to retain high quality translations only. The second method involves creating natural language prompts on top of existing Arabic NLP datasets. Using these two methods we were able to create more than 67.4 million Arabic prompts that cover a variety of tasks including summarization, headline generation, grammar checking, open/closed question answering, creative writing, etc. We show that fine tuning an open 7 billion parameter large language model, namely base Qwen2 7B, enables it to outperform a state-of-the-art 70 billion parameter instruction tuned model, namely Llama3 70B, in handling Arabic prompts.

Via

Access Paper or Ask Questions

An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation

Feb 26, 2024

Ahmet Gunduz, Kamer Ali Yuksel, Kareem Darwish, Golara Javadi, Fabio Minazzi, Nicola Sobieski, Sebastien Bratieres

Figure 1 for An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation

Figure 2 for An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation

Figure 3 for An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation

Figure 4 for An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation

Abstract:Data availability is crucial for advancing artificial intelligence applications, including voice-based technologies. As content creation, particularly in social media, experiences increasing demand, translation and text-to-speech (TTS) technologies have become essential tools. Notably, the performance of these TTS technologies is highly dependent on the quality of the training data, emphasizing the mutual dependence of data availability and technological progress. This paper introduces an end-to-end tool to generate high-quality datasets for text-to-speech (TTS) models to address this critical need for high-quality data. The contributions of this work are manifold and include: the integration of language-specific phoneme distribution into sample selection, automation of the recording process, automated and human-in-the-loop quality assurance of recordings, and processing of recordings to meet specified formats. The proposed application aims to streamline the dataset creation process for TTS models through these features, thereby facilitating advancements in voice-based technologies.

* 9 Pages, 6 Figures, 4 Tables, LREC-COLING 2024

Via

Access Paper or Ask Questions

NatiQ: An End-to-end Text-to-Speech System for Arabic

Jun 15, 2022

Ahmed Abdelali, Nadir Durrani, Cenk Demiroglu, Fahim Dalvi, Hamdy Mubarak, Kareem Darwish

Figure 1 for NatiQ: An End-to-end Text-to-Speech System for Arabic

Figure 2 for NatiQ: An End-to-end Text-to-Speech System for Arabic

Figure 3 for NatiQ: An End-to-end Text-to-Speech System for Arabic

Figure 4 for NatiQ: An End-to-end Text-to-Speech System for Arabic

Abstract:NatiQ is end-to-end text-to-speech system for Arabic. Our speech synthesizer uses an encoder-decoder architecture with attention. We used both tacotron-based models (tacotron-1 and tacotron-2) and the faster transformer model for generating mel-spectrograms from characters. We concatenated Tacotron1 with the WaveRNN vocoder, Tacotron2 with the WaveGlow vocoder and ESPnet transformer with the parallel wavegan vocoder to synthesize waveforms from the spectrograms. We used in-house speech data for two voices: 1) neutral male "Hamza"- narrating general content and news, and 2) expressive female "Amina"- narrating children story books to train our models. Our best systems achieve an average Mean Opinion Score (MOS) of 4.21 and 4.40 for Amina and Hamza respectively. The objective evaluation of the systems using word and character error rate (WER and CER) as well as the response time measured by real-time factor favored the end-to-end architecture ESPnet. NatiQ demo is available on-line at https://tts.qcri.org

Via

Access Paper or Ask Questions

Automatic Expansion and Retargeting of Arabic Offensive Language Training

Nov 18, 2021

Hamdy Mubarak, Ahmed Abdelali, Kareem Darwish, Younes Samih

Figure 1 for Automatic Expansion and Retargeting of Arabic Offensive Language Training

Figure 2 for Automatic Expansion and Retargeting of Arabic Offensive Language Training

Figure 3 for Automatic Expansion and Retargeting of Arabic Offensive Language Training

Abstract:Rampant use of offensive language on social media led to recent efforts on automatic identification of such language. Though offensive language has general characteristics, attacks on specific entities may exhibit distinct phenomena such as malicious alterations in the spelling of names. In this paper, we present a method for identifying entity specific offensive language. We employ two key insights, namely that replies on Twitter often imply opposition and some accounts are persistent in their offensiveness towards specific targets. Using our methodology, we are able to collect thousands of targeted offensive tweets. We show the efficacy of the approach on Arabic tweets with 13% and 79% relative F1-measure improvement in entity specific offensive language detection when using deep-learning based and support vector machine based classifiers respectively. Further, expanding the training set with automatically identified offensive tweets directed at multiple entities can improve F1-measure by 48%.

Via

Access Paper or Ask Questions

Cross-lingual Emotion Detection

Jun 10, 2021

Sabit Hassan, Shaden Shaar, Kareem Darwish

Figure 1 for Cross-lingual Emotion Detection

Figure 2 for Cross-lingual Emotion Detection

Figure 3 for Cross-lingual Emotion Detection

Figure 4 for Cross-lingual Emotion Detection

Abstract:Emotion detection is of great importance for understanding humans. Constructing annotated datasets to train automated models can be expensive. We explore the efficacy of cross-lingual approaches that would use data from a source language to build models for emotion detection in a target language. We compare three approaches, namely: i) using inherently multilingual models; ii) translating training data into the target language; and iii) using an automatically tagged parallel corpus. In our study, we consider English as the source language with Arabic and Spanish as target languages. We study the effectiveness of different classification models such as BERT and SVMs trained with different features. Our BERT-based monolingual models that are trained on target language data surpass state-of-the-art (SOTA) by 4% and 5% absolute Jaccard score for Arabic and Spanish respectively. Next, we show that using cross-lingual approaches with English data alone, we can achieve more than 90% and 80% relative effectiveness of the Arabic and Spanish BERT models respectively. Lastly, we use LIME to interpret the differences between models.

Via

Access Paper or Ask Questions

Pre-Training BERT on Arabic Tweets: Practical Considerations

Feb 21, 2021

Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish, Younes Samih

Figure 1 for Pre-Training BERT on Arabic Tweets: Practical Considerations

Figure 2 for Pre-Training BERT on Arabic Tweets: Practical Considerations

Figure 3 for Pre-Training BERT on Arabic Tweets: Practical Considerations

Figure 4 for Pre-Training BERT on Arabic Tweets: Practical Considerations

Abstract:Pretraining Bidirectional Encoder Representations from Transformers (BERT) for downstream NLP tasks is a non-trival task. We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing. All are intended to support Arabic dialects and social media. The experiments highlight the centrality of data diversity and the efficacy of linguistically aware segmentation. They also highlight that more data or more training step do not necessitate better models. Our new models achieve new state-of-the-art results on several downstream tasks. The resulting models are released to the community under the name QARiB.

* 6 pages, 5 figures

Via

Access Paper or Ask Questions

BERT Transformer model for Detecting Arabic GPT2 Auto-Generated Tweets

Jan 22, 2021

Fouzi Harrag, Maria Debbah, Kareem Darwish, Ahmed Abdelali

Figure 1 for BERT Transformer model for Detecting Arabic GPT2 Auto-Generated Tweets

Figure 2 for BERT Transformer model for Detecting Arabic GPT2 Auto-Generated Tweets

Figure 3 for BERT Transformer model for Detecting Arabic GPT2 Auto-Generated Tweets

Figure 4 for BERT Transformer model for Detecting Arabic GPT2 Auto-Generated Tweets

Abstract:During the last two decades, we have progressively turned to the Internet and social media to find news, entertain conversations and share opinion. Recently, OpenAI has developed a ma-chine learning system called GPT-2 for Generative Pre-trained Transformer-2, which can pro-duce deepfake texts. It can generate blocks of text based on brief writing prompts that look like they were written by humans, facilitating the spread false or auto-generated text. In line with this progress, and in order to counteract potential dangers, several methods have been pro-posed for detecting text written by these language models. In this paper, we propose a transfer learning based model that will be able to detect if an Arabic sentence is written by humans or automatically generated by bots. Our dataset is based on tweets from a previous work, which we have crawled and extended using the Twitter API. We used GPT2-Small-Arabic to generate fake Arabic Sentences. For evaluation, we compared different recurrent neural network (RNN) word embeddings based baseline models, namely: LSTM, BI-LSTM, GRU and BI-GRU, with a transformer-based model. Our new transfer-learning model has obtained an accuracy up to 98%. To the best of our knowledge, this work is the first study where ARABERT and GPT2 were combined to detect and classify the Arabic auto-generated texts.

* Proceedings of the Fifth Arabic Natural Language Processing Workshop (WANLP @ COLING 2020)

Via

Access Paper or Ask Questions

A Panoramic Survey of Natural Language Processing in the Arab World

Nov 26, 2020

Kareem Darwish, Nizar Habash, Mourad Abbas, Hend Al-Khalifa, Huseein T. Al-Natsheh, Samhaa R. El-Beltagy, Houda Bouamor, Karim Bouzoubaa, Violetta Cavalli-Sforza, Wassim El-Hajj(+2 more)

Abstract:The term natural language refers to any system of symbolic communication (spoken, signed or written) without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (NLP) is the sub-field of artificial intelligence (AI) focused on modeling natural languages to build applications such as speech recognition and synthesis, machine translation, optical character recognition (OCR), sentiment analysis (SA), question answering, dialogue systems, etc. NLP is a highly interdisciplinary field with connections to computer science, linguistics, cognitive science, psychology, mathematics and others. Some of the earliest AI applications were in NLP (e.g., machine translation); and the last decade (2010-2020) in particular has witnessed an incredible increase in quality, matched with a rise in public awareness, use, and expectations of what may have seemed like science fiction in the past. NLP researchers pride themselves on developing language independent models and tools that can be applied to all human languages, e.g. machine translation systems can be built for a variety of languages using the same basic mechanisms and models. However, the reality is that some languages do get more attention (e.g., English and Chinese) than others (e.g., Hindi and Swahili). Arabic, the primary language of the Arab world and the religious language of millions of non-Arab Muslims is somewhere in the middle of this continuum. Though Arabic NLP has many challenges, it has seen many successes and developments. Next we discuss Arabic's main challenges as a necessary background, and we present a brief history of Arabic NLP. We then survey a number of its research areas, and close with a critical discussion of the future of Arabic NLP.

Via

Access Paper or Ask Questions

Political Framing: US COVID19 Blame Game

Jul 19, 2020

Chereen Shurafa, Kareem Darwish, Wajdi Zaghouani

Figure 1 for Political Framing: US COVID19 Blame Game

Figure 2 for Political Framing: US COVID19 Blame Game

Figure 3 for Political Framing: US COVID19 Blame Game

Figure 4 for Political Framing: US COVID19 Blame Game

Abstract:Through the use of Twitter, framing has become a prominent presidential campaign tool for politically active users. Framing is used to influence thoughts by evoking a particular perspective on an event. In this paper, we show that the COVID19 pandemic rather than being viewed as a public health issue, political rhetoric surrounding it is mostly shaped through a blame frame (blame Trump, China, or conspiracies) and a support frame (support candidates) backing the agenda of Republican and Democratic users in the lead up to the 2020 presidential campaign. We elucidate the divergences between supporters of both parties on Twitter via the use of frames. Additionally, we show how framing is used to positively or negatively reinforce users' thoughts. We look at how Twitter can efficiently be used to identify frames for topics through a reproducible pipeline.

* Social Informatics 2020 (SocInfo2020)

Via

Access Paper or Ask Questions