Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Amir Hussein

Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

Sep 27, 2023

Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe, Sanjeev Khudanpur

Figure 1 for Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

Figure 2 for Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

Figure 3 for Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

Figure 4 for Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

Abstract:Incorporating longer context has been shown to benefit machine translation, but the inclusion of context in end-to-end speech translation (E2E-ST) remains under-studied. To bridge this gap, we introduce target language context in E2E-ST, enhancing coherence and overcoming memory constraints of extended audio segments. Additionally, we propose context dropout to ensure robustness to the absence of context, and further improve performance by adding speaker information. Our proposed contextual E2E-ST outperforms the isolated utterance-based E2E-ST approach. Lastly, we demonstrate that in conversational speech, contextual information primarily contributes to capturing context style, as well as resolving anaphora and named entities.

Via

Access Paper or Ask Questions

Speech collage: code-switched audio generation by collaging monolingual corpora

Sep 27, 2023

Amir Hussein, Dorsa Zeinali, Ondřej Klejch, Matthew Wiesner, Brian Yan, Shammur Chowdhury, Ahmed Ali, Shinji Watanabe, Sanjeev Khudanpur

Figure 1 for Speech collage: code-switched audio generation by collaging monolingual corpora

Figure 2 for Speech collage: code-switched audio generation by collaging monolingual corpora

Figure 3 for Speech collage: code-switched audio generation by collaging monolingual corpora

Figure 4 for Speech collage: code-switched audio generation by collaging monolingual corpora

Abstract:Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We investigate the impact of generated data on speech recognition in two scenarios: using in-domain CS text and a zero-shot approach with synthesized CS text. Empirical results highlight up to 34.4% and 16.2% relative reductions in Mixed-Error Rate and Word-Error Rate for in-domain and zero-shot scenarios, respectively. Lastly, we demonstrate that CS augmentation bolsters the model's code-switching inclination and reduces its monolingual bias.

Via

Access Paper or Ask Questions

Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition

Nov 22, 2022

Injy Hamed, Amir Hussein, Oumnia Chellah, Shammur Chowdhury, Hamdy Mubarak, Sunayana Sitaram, Nizar Habash, Ahmed Ali

Abstract:Code-switching poses a number of challenges and opportunities for multilingual automatic speech recognition. In this paper, we focus on the question of robust and fair evaluation metrics. To that end, we develop a reference benchmark data set of code-switching speech recognition hypotheses with human judgments. We define clear guidelines for minimal editing of automatic hypotheses. We validate the guidelines using 4-way inter-annotator agreement. We evaluate a large number of metrics in terms of correlation with human judgments. The metrics we consider vary in terms of representation (orthographic, phonological, semantic), directness (intrinsic vs extrinsic), granularity (e.g. word, character), and similarity computation method. The highest correlation to human judgment is achieved using transliteration followed by text normalization. We release the first corpus for human acceptance of code-switching speech recognition results in dialectal Arabic/English conversation speech.

* Accepted to SLT 2022

Via

Access Paper or Ask Questions

Code-Switching Text Augmentation for Multilingual Speech Processing

Jan 07, 2022

Amir Hussein, Shammur Absar Chowdhury, Ahmed Abdelali, Najim Dehak, Ahmed Ali

Figure 1 for Code-Switching Text Augmentation for Multilingual Speech Processing

Figure 2 for Code-Switching Text Augmentation for Multilingual Speech Processing

Figure 3 for Code-Switching Text Augmentation for Multilingual Speech Processing

Figure 4 for Code-Switching Text Augmentation for Multilingual Speech Processing

Abstract:The pervasiveness of intra-utterance Code-switching (CS) in spoken content has enforced ASR systems to handle mixed input. Yet, designing a CS-ASR has many challenges, mainly due to the data scarcity, grammatical structure complexity, and mismatch along with unbalanced language usage distribution. Recent ASR studies showed the predominance of E2E-ASR using multilingual data to handle CS phenomena with little CS data. However, the dependency on the CS data still remains. In this work, we propose a methodology to augment the monolingual data for artificially generating spoken CS text to improve different speech modules. We based our approach on Equivalence Constraint theory while exploiting aligned translation pairs, to generate grammatically valid CS content. Our empirical results show a relative gain of 29-34 % in perplexity and around 2% in WER for two ecological and noisy CS test sets. Finally, the human evaluation suggests that 83.8% of the generated data is acceptable to humans.

Via

Access Paper or Ask Questions

Arabic Code-Switching Speech Recognition using Monolingual Data

Jul 04, 2021

Ahmed Ali, Shammur Chowdhury, Amir Hussein, Yasser Hifny

Figure 1 for Arabic Code-Switching Speech Recognition using Monolingual Data

Figure 2 for Arabic Code-Switching Speech Recognition using Monolingual Data

Figure 3 for Arabic Code-Switching Speech Recognition using Monolingual Data

Figure 4 for Arabic Code-Switching Speech Recognition using Monolingual Data

Abstract:Code-switching in automatic speech recognition (ASR) is an important challenge due to globalization. Recent research in multilingual ASR shows potential improvement over monolingual systems. We study key issues related to multilingual modeling for ASR through a series of large-scale ASR experiments. Our innovative framework deploys a multi-graph approach in the weighted finite state transducers (WFST) framework. We compare our WFST decoding strategies with a transformer sequence to sequence system trained on the same data. Given a code-switching scenario between Arabic and English languages, our results show that the WFST decoding approaches were more suitable for the intersentential code-switching datasets. In addition, the transformer system performed better for intrasentential code-switching task. With this study, we release an artificially generated development and test sets, along with ecological code-switching test set, to benchmark the ASR performance.

* Accepted in Interspeech 2021, speech recognition, code-switching, ASR, transformer, WFST, graph approach

Via

Access Paper or Ask Questions

QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus

Jun 24, 2021

Hamdy Mubarak, Amir Hussein, Shammur Absar Chowdhury, Ahmed Ali

Figure 1 for QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus

Figure 2 for QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus

Figure 3 for QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus

Figure 4 for QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus

Abstract:We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics- based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR reports a competitive word error rate compared to the previous MGB-2 corpus. We report baseline results for downstream natural language processing tasks such as named entity recognition using speech transcript. We also report the first baseline for Arabic punctuation restoration. We make the corpus available for the research community.

* Speech Corpus, Spoken Conversation, ASR, Dialect Identification, Punctuation Restoration, Speaker Verification, NER, Named Entity, Arabic, Speaker gender, Turn-taking Accepted in ACL 2021

Via

Access Paper or Ask Questions

KARI: KAnari/QCRI's End-to-End systems for the INTERSPEECH 2021 Indian Languages Code-Switching Challenge

Jun 10, 2021

Amir Hussein, Shammur Chowdhury, Ahmed Ali

Figure 1 for KARI: KAnari/QCRI's End-to-End systems for the INTERSPEECH 2021 Indian Languages Code-Switching Challenge

Figure 2 for KARI: KAnari/QCRI's End-to-End systems for the INTERSPEECH 2021 Indian Languages Code-Switching Challenge

Figure 3 for KARI: KAnari/QCRI's End-to-End systems for the INTERSPEECH 2021 Indian Languages Code-Switching Challenge

Figure 4 for KARI: KAnari/QCRI's End-to-End systems for the INTERSPEECH 2021 Indian Languages Code-Switching Challenge

Abstract:In this paper, we present the Kanari/QCRI (KARI) system and the modeling strategies used to participate in the Interspeech 2021 Code-switching (CS) challenge for low-resource Indian languages. The subtask involved developing a speech recognition system for two CS datasets: Hindi-English and Bengali-English, collected in a real-life scenario. To tackle the CS challenges, we use transfer learning for incorporating the publicly available monolingual Hindi, Bengali, and English speech data. In this work, we study the effectiveness of two steps transfer learning protocol for low-resourced CS data: monolingual pretraining, followed by fine-tuning. For acoustic modeling, we develop an end-to-end convolution-augmented transformer (Conformer). We show that selecting the percentage of each monolingual data affects model biases towards using one language character set over the other in a CS scenario. The models pretrained on well-aligned and accurate monolingual data showed robustness against misalignment between the segments and the transcription. Finally, we develop word-level n-gram language models (LM) to rescore ASR recognition.

Via

Access Paper or Ask Questions

Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR

May 31, 2021

Shammur Absar Chowdhury, Amir Hussein, Ahmed Abdelali, Ahmed Ali

Figure 1 for Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR

Figure 2 for Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR

Figure 3 for Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR

Figure 4 for Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR

Abstract:With the advent of globalization, there is an increasing demand for multilingual automatic speech recognition (ASR), handling language and dialectal variation of spoken content. Recent studies show its efficacy over monolingual systems. In this study, we design a large multilingual end-to-end ASR using self-attention based conformer architecture. We trained the system using Arabic (Ar), English (En) and French (Fr) languages. We evaluate the system performance handling: (i) monolingual (Ar, En and Fr); (ii) multi-dialectal (Modern Standard Arabic, along with dialectal variation such as Egyptian and Moroccan); (iii) code-switching -- cross-lingual (Ar-En/Fr) and dialectal (MSA-Egyptian dialect) test cases, and compare with current state-of-the-art systems. Furthermore, we investigate the influence of different embedding/character representations including character vs word-piece; shared vs distinct input symbol per language. Our findings demonstrate the strength of such a model by outperforming state-of-the-art monolingual dialectal Arabic and code-switching Arabic ASR.

* Submitted to INTERSPEECH 2021, Multilingual ASR, Multi-dialectal ASR, Code-Switching ASR, Arabic ASR, Conformer, Transformer, E2E ASR, Speech Recognition, ASR, Arabic, English, French

Via

Access Paper or Ask Questions

Arabic Speech Recognition by End-to-End, Modular Systems and Human

Jan 21, 2021

Amir Hussein, Shinji Watanabe, Ahmed Ali

Figure 1 for Arabic Speech Recognition by End-to-End, Modular Systems and Human

Figure 2 for Arabic Speech Recognition by End-to-End, Modular Systems and Human

Figure 3 for Arabic Speech Recognition by End-to-End, Modular Systems and Human

Figure 4 for Arabic Speech Recognition by End-to-End, Modular Systems and Human

Abstract:Recent advances in automatic speech recognition (ASR) have achieved accuracy levels comparable to human transcribers, which led researchers to debate if the machine has reached human performance. Previous work focused on the English language and modular hidden Markov model-deep neural network (HMM-DNN) systems. In this paper, we perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition (HSR) on the Arabic language and its dialects. For the HSR, we evaluate linguist performance and lay-native speaker performance on a new dataset collected as a part of this study. For ASR the end-to-end work led to 12.5%, 27.5%, 33.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively. Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.

Via

Access Paper or Ask Questions