Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Walid Magdy

Revisiting Common Assumptions about Arabic Dialects in NLP

May 27, 2025

Amr Keleg, Sharon Goldwater, Walid Magdy

Figure 1 for Revisiting Common Assumptions about Arabic Dialects in NLP

Figure 2 for Revisiting Common Assumptions about Arabic Dialects in NLP

Figure 3 for Revisiting Common Assumptions about Arabic Dialects in NLP

Figure 4 for Revisiting Common Assumptions about Arabic Dialects in NLP

Abstract:Arabic has diverse dialects, where one dialect can be substantially different from the others. In the NLP literature, some assumptions about these dialects are widely adopted (e.g., ``Arabic dialects can be grouped into distinguishable regional dialects") and are manifested in different computational tasks such as Arabic Dialect Identification (ADI). However, these assumptions are not quantitatively verified. We identify four of these assumptions and examine them by extending and analyzing a multi-label dataset, where the validity of each sentence in 11 different country-level dialects is manually assessed by speakers of these dialects. Our analysis indicates that the four assumptions oversimplify reality, and some of them are not always accurate. This in turn might be hindering further progress in different Arabic NLP tasks.

* Accepted to ACL 2025

Via

Access Paper or Ask Questions

NADI 2024: The Fifth Nuanced Arabic Dialect Identification Shared Task

Jul 06, 2024

Muhammad Abdul-Mageed, Amr Keleg, AbdelRahim Elmadany, Chiyu Zhang, Injy Hamed, Walid Magdy, Houda Bouamor, Nizar Habash

Abstract:We describe the findings of the fifth Nuanced Arabic Dialect Identification Shared Task (NADI 2024). NADI's objective is to help advance SoTA Arabic NLP by providing guidance, datasets, modeling opportunities, and standardized evaluation conditions that allow researchers to collaboratively compete on pre-specified tasks. NADI 2024 targeted both dialect identification cast as a multi-label task (Subtask~1), identification of the Arabic level of dialectness (Subtask~2), and dialect-to-MSA machine translation (Subtask~3). A total of 51 unique teams registered for the shared task, of whom 12 teams have participated (with 76 valid submissions during the test phase). Among these, three teams participated in Subtask~1, three in Subtask~2, and eight in Subtask~3. The winning teams achieved 50.57 F\textsubscript{1} on Subtask~1, 0.1403 RMSE for Subtask~2, and 20.44 BLEU in Subtask~3, respectively. Results show that Arabic dialect processing tasks such as dialect identification and machine translation remain challenging. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.

* Accepted by The Second Arabic Natural Language Processing Conference

Via

Access Paper or Ask Questions

Estimating the Level of Dialectness Predicts Interannotator Agreement in Multi-dialect Arabic Datasets

May 18, 2024

Amr Keleg, Walid Magdy, Sharon Goldwater

Figure 1 for Estimating the Level of Dialectness Predicts Interannotator Agreement in Multi-dialect Arabic Datasets

Figure 2 for Estimating the Level of Dialectness Predicts Interannotator Agreement in Multi-dialect Arabic Datasets

Figure 3 for Estimating the Level of Dialectness Predicts Interannotator Agreement in Multi-dialect Arabic Datasets

Abstract:On annotating multi-dialect Arabic datasets, it is common to randomly assign the samples across a pool of native Arabic speakers. Recent analyses recommended routing dialectal samples to native speakers of their respective dialects to build higher-quality datasets. However, automatically identifying the dialect of samples is hard. Moreover, the pool of annotators who are native speakers of specific Arabic dialects might be scarce. Arabic Level of Dialectness (ALDi) was recently introduced as a quantitative variable that measures how sentences diverge from Standard Arabic. On randomly assigning samples to annotators, we hypothesize that samples of higher ALDi scores are harder to label especially if they are written in dialects that the annotators do not speak. We test this by analyzing the relation between ALDi scores and the annotators' agreement, on 15 public datasets having raw individual sample annotations for various sentence-classification tasks. We find strong evidence supporting our hypothesis for 11 of them. Consequently, we recommend prioritizing routing samples of high ALDi scores to native speakers of each sample's dialect, for which the dialect could be automatically identified at higher accuracies.

* Accepted to ACL 2024 (Main)

Via

Access Paper or Ask Questions

ALDi: Quantifying the Arabic Level of Dialectness of Text

Oct 20, 2023

Amr Keleg, Sharon Goldwater, Walid Magdy

Figure 1 for ALDi: Quantifying the Arabic Level of Dialectness of Text

Figure 2 for ALDi: Quantifying the Arabic Level of Dialectness of Text

Figure 3 for ALDi: Quantifying the Arabic Level of Dialectness of Text

Figure 4 for ALDi: Quantifying the Arabic Level of Dialectness of Text

Abstract:Transcribed speech and user-generated text in Arabic typically contain a mixture of Modern Standard Arabic (MSA), the standardized language taught in schools, and Dialectal Arabic (DA), used in daily communications. To handle this variation, previous work in Arabic NLP has focused on Dialect Identification (DI) on the sentence or the token level. However, DI treats the task as binary, whereas we argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi), a continuous linguistic variable. We introduce the AOC-ALDi dataset (derived from the AOC dataset), containing 127,835 sentences (17% from news articles and 83% from user comments on those articles) which are manually labeled with their level of dialectness. We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora (including dialects and genres not included in AOC-ALDi), providing a more nuanced picture than traditional DI systems. Through case studies, we illustrate how ALDi can reveal Arabic speakers' stylistic choices in different situations, a useful property for sociolinguistic analyses.

* Accepted to EMNLP 2023

Via

Access Paper or Ask Questions

Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification

Oct 20, 2023

Amr Keleg, Walid Magdy

Figure 1 for Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification

Figure 2 for Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification

Figure 3 for Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification

Figure 4 for Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification

Abstract:Automatic Arabic Dialect Identification (ADI) of text has gained great popularity since it was introduced in the early 2010s. Multiple datasets were developed, and yearly shared tasks have been running since 2018. However, ADI systems are reported to fail in distinguishing between the micro-dialects of Arabic. We argue that the currently adopted framing of the ADI task as a single-label classification problem is one of the main reasons for that. We highlight the limitation of the incompleteness of the Dialect labels and demonstrate how it impacts the evaluation of ADI systems. A manual error analysis for the predictions of an ADI, performed by 7 native speakers of different Arabic dialects, revealed that $\approx$ 66% of the validated errors are not true errors. Consequently, we propose framing ADI as a multi-label classification task and give recommendations for designing new ADI datasets.

* Accepted to the ArabicNLP 2023 conference co-located with EMNLP 2023

Via

Access Paper or Ask Questions

DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models

Jun 08, 2023

Amr Keleg, Walid Magdy

Figure 1 for DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models

Figure 2 for DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models

Figure 3 for DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models

Figure 4 for DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models

Abstract:A few benchmarking datasets have been released to evaluate the factual knowledge of pretrained language models. These benchmarks (e.g., LAMA, and ParaRel) are mainly developed in English and later are translated to form new multilingual versions (e.g., mLAMA, and mParaRel). Results on these multilingual benchmarks suggest that using English prompts to recall the facts from multilingual models usually yields significantly better and more consistent performance than using non-English prompts. Our analysis shows that mLAMA is biased toward facts from Western countries, which might affect the fairness of probing models. We propose a new framework for curating factual triples from Wikidata that are culturally diverse. A new benchmark DLAMA-v1 is built of factual triples from three pairs of contrasting cultures having a total of 78,259 triples from 20 relation predicates. The three pairs comprise facts representing the (Arab and Western), (Asian and Western), and (South American and Western) countries respectively. Having a more balanced benchmark (DLAMA-v1) supports that mBERT performs better on Western facts than non-Western ones, while monolingual Arabic, English, and Korean models tend to perform better on their culturally proximate facts. Moreover, both monolingual and multilingual models tend to make a prediction that is culturally or geographically relevant to the correct label, even if the prediction is wrong.

* Accepted to ACL 2023 (Findings)

Via

Access Paper or Ask Questions

AX-MABSA: A Framework for Extremely Weakly Supervised Multi-label Aspect Based Sentiment Analysis

Nov 07, 2022

Sabyasachi Kamila, Walid Magdy, Sourav Dutta, MingXue Wang

Abstract:Aspect Based Sentiment Analysis is a dominant research area with potential applications in social media analytics, business, finance, and health. Prior works in this area are primarily based on supervised methods, with a few techniques using weak supervision limited to predicting a single aspect category per review sentence. In this paper, we present an extremely weakly supervised multi-label Aspect Category Sentiment Analysis framework which does not use any labelled data. We only rely on a single word per class as an initial indicative information. We further propose an automatic word selection technique to choose these seed categories and sentiment words. We explore unsupervised language model post-training to improve the overall performance, and propose a multi-label generator model to generate multiple aspect category-sentiment pairs per review sentence. Experiments conducted on four benchmark datasets showcase our method to outperform other weakly supervised baselines by a significant margin.

* to be published in EMNLP 2022

Via

Access Paper or Ask Questions

Don't Take it Personally: Analyzing Gender and Age Differences in Ratings of Online Humor

Aug 23, 2022

J. A. Meaney, Steven R. Wilson, Luis Chiruzzo, Walid Magdy

Figure 1 for Don't Take it Personally: Analyzing Gender and Age Differences in Ratings of Online Humor

Figure 2 for Don't Take it Personally: Analyzing Gender and Age Differences in Ratings of Online Humor

Figure 3 for Don't Take it Personally: Analyzing Gender and Age Differences in Ratings of Online Humor

Figure 4 for Don't Take it Personally: Analyzing Gender and Age Differences in Ratings of Online Humor

Abstract:Computational humor detection systems rarely model the subjectivity of humor responses, or consider alternative reactions to humor - namely offense. We analyzed a large dataset of humor and offense ratings by male and female annotators of different age groups. We find that women link these two concepts more strongly than men, and they tend to give lower humor ratings and higher offense scores. We also find that the correlation between humor and offense increases with age. Although there were no gender or age differences in humor detection, women and older annotators signalled that they did not understand joke texts more often than men. We discuss implications for computational humor detection and downstream tasks.

Via

Access Paper or Ask Questions

Black or White but never neutral: How readers perceive identity from yellow or skin-toned emoji

May 12, 2021

Alexander Robertson, Walid Magdy, Sharon Goldwater

Abstract:Research in sociology and linguistics shows that people use language not only to express their own identity but to understand the identity of others. Recent work established a connection between expression of identity and emoji usage on social media, through use of emoji skin tone modifiers. Motivated by that finding, this work asks if, as with language, readers are sensitive to such acts of self-expression and use them to understand the identity of authors. In behavioral experiments (n=488), where text and emoji content of social media posts were carefully controlled before being presented to participants, we find in the affirmative -- emoji are a salient signal of author identity. That signal is distinct from, and complementary to, the one encoded in language. Participant groups (based on self-identified ethnicity) showed no differences in how they perceive this signal, except in the case of the default yellow emoji. While both groups associate this with a White identity, the effect was stronger in White participants. Our finding that emoji can index social variables will have experimental applications for researchers but also implications for designers: supposedly ``neutral`` defaults may be more representative of some users than others.

Via

Access Paper or Ask Questions

Identity Signals in Emoji Do not Influence Perception of Factual Truth on Twitter

May 07, 2021

Alexander Robertson, Walid Magdy, Sharon Goldwater

Figure 1 for Identity Signals in Emoji Do not Influence Perception of Factual Truth on Twitter

Figure 2 for Identity Signals in Emoji Do not Influence Perception of Factual Truth on Twitter

Figure 3 for Identity Signals in Emoji Do not Influence Perception of Factual Truth on Twitter

Figure 4 for Identity Signals in Emoji Do not Influence Perception of Factual Truth on Twitter

Abstract:Prior work has shown that Twitter users use skin-toned emoji as an act of self-representation to express their racial/ethnic identity. We test whether this signal of identity can influence readers' perceptions about the content of a post containing that signal. In a large scale (n=944) pre-registered controlled experiment, we manipulate the presence of skin-toned emoji and profile photos in a task where readers rate obscure trivia facts (presented as tweets) as true or false. Using a Bayesian statistical analysis, we find that neither emoji nor profile photo has an effect on how readers rate these facts. This result will be of some comfort to anyone concerned about the manipulation of online users through the crafting of fake profiles.

Via

Access Paper or Ask Questions