Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Laxmi Pandey

Towards scalable efficient on-device ASR with transfer learning

Jul 23, 2024

Laxmi Pandey, Ke Li, Jinxi Guo, Debjyoti Paul, Arthur Guo, Jay Mahadeokar, Xuedong Zhang

Figure 1 for Towards scalable efficient on-device ASR with transfer learning

Figure 2 for Towards scalable efficient on-device ASR with transfer learning

Figure 3 for Towards scalable efficient on-device ASR with transfer learning

Figure 4 for Towards scalable efficient on-device ASR with transfer learning

Abstract:Multilingual pretraining for transfer learning significantly boosts the robustness of low-resource monolingual ASR models. This study systematically investigates three main aspects: (a) the impact of transfer learning on model performance during initial training or fine-tuning, (b) the influence of transfer learning across dataset domains and languages, and (c) the effect on rare-word recognition compared to non-rare words. Our finding suggests that RNNT-loss pretraining, followed by monolingual fine-tuning with Minimum Word Error Rate (MinWER) loss, consistently reduces Word Error Rates (WER) across languages like Italian and French. WER Reductions (WERR) reach 36.2% and 42.8% compared to monolingual baselines for MLS and in-house datasets. Out-of-domain pretraining leads to 28% higher WERR than in-domain pretraining. Both rare and non-rare words benefit, with rare words showing greater improvements with out-of-domain pretraining, and non-rare words with in-domain pretraining.

Via

Access Paper or Ask Questions

Improving Data Driven Inverse Text Normalization using Data Augmentation

Jul 20, 2022

Laxmi Pandey, Debjyoti Paul, Pooja Chitkara, Yutong Pang, Xuedong Zhang, Kjell Schubert, Mark Chou, Shu Liu, Yatharth Saraf

Figure 1 for Improving Data Driven Inverse Text Normalization using Data Augmentation

Figure 2 for Improving Data Driven Inverse Text Normalization using Data Augmentation

Figure 3 for Improving Data Driven Inverse Text Normalization using Data Augmentation

Figure 4 for Improving Data Driven Inverse Text Normalization using Data Augmentation

Abstract:Inverse text normalization (ITN) is used to convert the spoken form output of an automatic speech recognition (ASR) system to a written form. Traditional handcrafted ITN rules can be complex to transcribe and maintain. Meanwhile neural modeling approaches require quality large-scale spoken-written pair examples in the same or similar domain as the ASR system (in-domain data), to train. Both these approaches require costly and complex annotations. In this paper, we present a data augmentation technique that effectively generates rich spoken-written numeric pairs from out-of-domain textual data with minimal human annotation. We empirically demonstrate that ITN model trained using our data augmentation technique consistently outperform ITN model trained using only in-domain data across all numeric surfaces like cardinal, currency, and fraction, by an overall accuracy of 14.44%.

Via

Access Paper or Ask Questions

Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI

Jun 16, 2021

Laxmi Pandey, Ahmed Sabbir Arif

Figure 1 for Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI

Figure 2 for Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI

Figure 3 for Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI

Figure 4 for Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI

Abstract:Speech sounds of spoken language are obtained by varying configuration of the articulators surrounding the vocal tract. They contain abundant information that can be utilized to better understand the underlying mechanism of human speech production. We propose a novel deep neural network-based learning framework that understands acoustic information in the variable-length sequence of vocal tract shaping during speech production, captured by real-time magnetic resonance imaging (rtMRI), and translate it into text. The proposed framework comprises of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. On the USC-TIMIT corpus, the model achieved a 40.6% PER at sentence-level, much better compared to the existing models. To the best of our knowledge, this is the first study that demonstrates the recognition of entire spoken sentence based on an individual's articulatory motions captured by rtMRI video. We also performed an analysis of variations in the geometry of articulation in each sub-regions of the vocal tract (i.e., pharyngeal, velar and dorsal, hard palate, labial constriction region) with respect to different emotions and genders. Results suggest that each sub-regions distortion is affected by both emotion and gender.

* 8 pages

Via

Access Paper or Ask Questions