Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Badr M. Abdullah

Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect Identification

May 30, 2025

Badr M. Abdullah, Matthew Baas, Bernd Möbius, Dietrich Klakow

Abstract:Arabic dialect identification (ADI) systems are essential for large-scale data collection pipelines that enable the development of inclusive speech technologies for Arabic language varieties. However, the reliability of current ADI systems is limited by poor generalization to out-of-domain speech. In this paper, we present an effective approach based on voice conversion for training ADI models that achieves state-of-the-art performance and significantly improves robustness in cross-domain scenarios. Evaluated on a newly collected real-world test set spanning four different domains, our approach yields consistent improvements of up to +34.1% in accuracy across domains. Furthermore, we present an analysis of our approach and demonstrate that voice conversion helps mitigate the speaker bias in the ADI dataset. We release our robust ADI model and cross-domain evaluation dataset to support the development of inclusive speech technologies for Arabic.

* Accepted in Interspeech 2025

Via

Access Paper or Ask Questions

Attention on Multiword Expressions: A Multilingual Study of BERT-based Models with Regard to Idiomaticity and Microsyntax

May 09, 2025

Iuliia Zaitova, Vitalii Hirak, Badr M. Abdullah, Dietrich Klakow, Bernd Möbius, Tania Avgustinova

Abstract:This study analyzes the attention patterns of fine-tuned encoder-only models based on the BERT architecture (BERT-based models) towards two distinct types of Multiword Expressions (MWEs): idioms and microsyntactic units (MSUs). Idioms present challenges in semantic non-compositionality, whereas MSUs demonstrate unconventional syntactic behavior that does not conform to standard grammatical categorizations. We aim to understand whether fine-tuning BERT-based models on specific tasks influences their attention to MWEs, and how this attention differs between semantic and syntactic tasks. We examine attention scores to MWEs in both pre-trained and fine-tuned BERT-based models. We utilize monolingual models and datasets in six Indo-European languages - English, German, Dutch, Polish, Russian, and Ukrainian. Our results show that fine-tuning significantly influences how models allocate attention to MWEs. Specifically, models fine-tuned on semantic tasks tend to distribute attention to idiomatic expressions more evenly across layers. Models fine-tuned on syntactic tasks show an increase in attention to MSUs in the lower layers, corresponding with syntactic processing requirements.

* In Findings of the Association for Computational Linguistics: NAACL 2025, pages 4083–4092, Albuquerque, New Mexico https://aclanthology.org/2025.findings-naacl.228/
* 10 pages, 3 figures. Findings 2025

Via

Access Paper or Ask Questions

On the Encoding of Gender in Transformer-based ASR Representations

Jun 14, 2024

Aravind Krishnan, Badr M. Abdullah, Dietrich Klakow

Abstract:While existing literature relies on performance differences to uncover gender biases in ASR models, a deeper analysis is essential to understand how gender is encoded and utilized during transcript generation. This work investigates the encoding and utilization of gender in the latent representations of two transformer-based ASR models, Wav2Vec2 and HuBERT. Using linear erasure, we demonstrate the feasibility of removing gender information from each layer of an ASR model and show that such an intervention has minimal impacts on the ASR performance. Additionally, our analysis reveals a concentration of gender information within the first and last frames in the final layers, explaining the ease of erasing gender in these layers. Our findings suggest the prospect of creating gender-neutral embeddings that can be integrated into ASR frameworks without compromising their efficacy.

* Accepted at Interspeech 2024

Via

Access Paper or Ask Questions

Self-supervised Adaptive Pre-training of Multilingual Speech Models for Language and Dialect Identification

Dec 12, 2023

Mohammed Maqsood Shaik, Dietrich Klakow, Badr M. Abdullah

Abstract:Pre-trained Transformer-based speech models have shown striking performance when fine-tuned on various downstream tasks such as automatic speech recognition and spoken language identification (SLID). However, the problem of domain mismatch remains a challenge in this area, where the domain of the pre-training data might differ from that of the downstream labeled data used for fine-tuning. In multilingual tasks such as SLID, the pre-trained speech model may not support all the languages in the downstream task. To address this challenge, we propose self-supervised adaptive pre-training (SAPT) to adapt the pre-trained model to the target domain and languages of the downstream task. We apply SAPT to the XLSR-128 model and investigate the effectiveness of this approach for the SLID task. First, we demonstrate that SAPT improves XLSR performance on the FLEURS benchmark with substantial gains up to 40.1% for under-represented languages. Second, we apply SAPT on four different datasets in a few-shot learning setting, showing that our approach improves the sample efficiency of XLSR during fine-tuning. Our experiments provide strong empirical evidence that continual adaptation via self-supervision improves downstream performance for multilingual speech models.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

An Information-Theoretic Analysis of Self-supervised Discrete Representations of Speech

Jun 04, 2023

Badr M. Abdullah, Mohammed Maqsood Shaik, Bernd Möbius, Dietrich Klakow

Abstract:Self-supervised representation learning for speech often involves a quantization step that transforms the acoustic input into discrete units. However, it remains unclear how to characterize the relationship between these discrete units and abstract phonetic categories such as phonemes. In this paper, we develop an information-theoretic framework whereby we represent each phonetic category as a distribution over discrete units. We then apply our framework to two different self-supervised models (namely wav2vec 2.0 and XLSR) and use American English speech as a case study. Our study demonstrates that the entropy of phonetic distributions reflects the variability of the underlying speech sounds, with phonetically similar sounds exhibiting similar distributions. While our study confirms the lack of direct, one-to-one correspondence, we find an intriguing, indirect relationship between phonetic categories and discrete units.

* Accepted in Interspeech 2023

Via

Access Paper or Ask Questions

Developing the Reliable Shallow Supervised Learning for Thermal Comfort using ASHRAE RP-884 and ASHRAE Global Thermal Comfort Database II

Mar 03, 2023

Kanisius Karyono, Badr M. Abdullah, Alison J. Cotgrave, Ana Bras, Jeff Cullen

Figure 1 for Developing the Reliable Shallow Supervised Learning for Thermal Comfort using ASHRAE RP-884 and ASHRAE Global Thermal Comfort Database II

Figure 2 for Developing the Reliable Shallow Supervised Learning for Thermal Comfort using ASHRAE RP-884 and ASHRAE Global Thermal Comfort Database II

Figure 3 for Developing the Reliable Shallow Supervised Learning for Thermal Comfort using ASHRAE RP-884 and ASHRAE Global Thermal Comfort Database II

Figure 4 for Developing the Reliable Shallow Supervised Learning for Thermal Comfort using ASHRAE RP-884 and ASHRAE Global Thermal Comfort Database II

Abstract:The artificial intelligence (AI) system designer for thermal comfort faces insufficient data recorded from the current user or overfitting due to unreliable training data. This work introduces the reliable data set for training the AI subsystem for thermal comfort. This paper presents the control algorithm based on shallow supervised learning, which is simple enough to be implemented in the Internet of Things (IoT) system for residential usage using ASHRAE RP-884 and ASHRAE Global Thermal Comfort Database II. No training data for thermal comfort is available as reliable as this dataset, but the direct use of this data can lead to overfitting. This work offers the algorithm for data filtering and semantic data augmentation for the ASHRAE database for the supervised learning process. Overfitting always becomes a problem due to the psychological aspect involved in the thermal comfort decision. The method to check the AI system based on the psychrometric chart against overfitting is presented. This paper also assesses the most important parameters needed to achieve human thermal comfort. This method can support the development of reinforced learning for thermal comfort.

* 15 pages with Appendix

Via

Access Paper or Ask Questions

Analyzing the Representational Geometry of Acoustic Word Embeddings

Jan 08, 2023

Badr M. Abdullah, Dietrich Klakow

Abstract:Acoustic word embeddings (AWEs) are vector representations such that different acoustic exemplars of the same word are projected nearby in the embedding space. In addition to their use in speech technology applications such as spoken term discovery and keyword spotting, AWE models have been adopted as models of spoken-word processing in several cognitively motivated studies and have been shown to exhibit human-like performance in some auditory processing tasks. Nevertheless, the representational geometry of AWEs remains an under-explored topic that has not been studied in the literature. In this paper, we take a closer analytical look at AWEs learned from English speech and study how the choice of the learning objective and the architecture shapes their representational profile. To this end, we employ a set of analytic techniques from machine learning and neuroscience in three different analyses: embedding space uniformity, word discriminability, and representational consistency. Our main findings highlight the prominent role of the learning objective on shaping the representation profile compared to the model architecture.

* In BlackboxNLP workshop, EMNLP 2022 [ oral presentation ]

Via

Access Paper or Ask Questions

Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings

Sep 18, 2022

Badr M. Abdullah, Bernd Möbius, Dietrich Klakow

Figure 1 for Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings

Figure 2 for Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings

Figure 3 for Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings

Figure 4 for Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings

Abstract:Models of acoustic word embeddings (AWEs) learn to map variable-length spoken word segments onto fixed-dimensionality vector representations such that different acoustic exemplars of the same word are projected nearby in the embedding space. In addition to their speech technology applications, AWE models have been shown to predict human performance on a variety of auditory lexical processing tasks. Current AWE models are based on neural networks and trained in a bottom-up approach that integrates acoustic cues to build up a word representation given an acoustic or symbolic supervision signal. Therefore, these models do not leverage or capture high-level lexical knowledge during the learning process. In this paper, we propose a multi-task learning model that incorporates top-down lexical knowledge into the training procedure of AWEs. Our model learns a mapping between the acoustic input and a lexical representation that encodes high-level information such as word semantics in addition to bottom-up form-based supervision. We experiment with three languages and demonstrate that incorporating lexical knowledge improves the embedding space discriminability and encourages the model to better separate lexical categories.

* Accepted in INTERSPEECH 2022

Via

Access Paper or Ask Questions

How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings

Sep 21, 2021

Badr M. Abdullah, Iuliia Zaitova, Tania Avgustinova, Bernd Möbius, Dietrich Klakow

Figure 1 for How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings

Figure 2 for How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings

Figure 3 for How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings

Figure 4 for How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings

Abstract:How do neural networks "perceive" speech sounds from unknown languages? Does the typological similarity between the model's training language (L1) and an unknown language (L2) have an impact on the model representations of L2 speech signals? To answer these questions, we present a novel experimental design based on representational similarity analysis (RSA) to analyze acoustic word embeddings (AWEs) -- vector representations of variable-duration spoken-word segments. First, we train monolingual AWE models on seven Indo-European languages with various degrees of typological similarity. We then employ RSA to quantify the cross-lingual similarity by simulating native and non-native spoken-word processing using AWEs. Our experiments show that typological similarity indeed affects the representational similarity of the models in our study. We further discuss the implications of our work on modeling speech processing and language similarity with neural networks.

* BlackboxNLP 2021

Via

Access Paper or Ask Questions

Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study

Jun 16, 2021

Badr M. Abdullah, Marius Mosbach, Iuliia Zaitova, Bernd Möbius, Dietrich Klakow

Figure 1 for Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study

Figure 2 for Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study

Figure 3 for Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study

Figure 4 for Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study

Abstract:Several variants of deep neural networks have been successfully employed for building parametric models that project variable-duration spoken word segments onto fixed-size vector representations, or acoustic word embeddings (AWEs). However, it remains unclear to what degree we can rely on the distance in the emerging AWE space as an estimate of word-form similarity. In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? To answer this question, we empirically investigate the performance of supervised approaches for AWEs with different neural architectures and learning objectives. We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity. Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity. Our findings highlight the necessity to rethink the current intrinsic evaluations for AWEs.

* Accepted in Interspeech 2021

Via

Access Paper or Ask Questions