Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pranav Dheram

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Mar 28, 2024

Yash Jain, David Chan, Pranav Dheram, Aparna Khare, Olabanji Shonibare, Venkatesh Ravichandran, Shalini Ghosh

Abstract:Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-training followed by fine-tuning on the downstream task. In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB. Additionally, we share several important findings for choosing pre-training methods and datasets.

* Accepted in LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

Via

Access Paper or Ask Questions

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Jan 26, 2024

Jinhan Wang, Long Chen, Aparna Khare, Anirudh Raju, Pranav Dheram, Di He, Minhua Wu, Andreas Stolcke, Venkatesh Ravichandran

Figure 1 for Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Figure 2 for Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Figure 3 for Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Figure 4 for Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Abstract:We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

* To appear in IEEE ICASSP 2024

Via

Access Paper or Ask Questions

Mining Duplicate Questions of Stack Overflow

Oct 04, 2022

Mihir Kale, Anirudha Rayasam, Radhika Parik, Pranav Dheram

Figure 1 for Mining Duplicate Questions of Stack Overflow

Figure 2 for Mining Duplicate Questions of Stack Overflow

Figure 3 for Mining Duplicate Questions of Stack Overflow

Abstract:There has a been a significant rise in the use of Community Question Answering sites (CQAs) over the last decade owing primarily to their ability to leverage the wisdom of the crowd. Duplicate questions have a crippling effect on the quality of these sites. Tackling duplicate questions is therefore an important step towards improving quality of CQAs. In this regard, we propose two neural network based architectures for duplicate question detection on Stack Overflow. We also propose explicitly modeling the code present in questions to achieve results that surpass the state of the art.

Via

Access Paper or Ask Questions

Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Jul 22, 2022

Pranav Dheram, Murugesan Ramakrishnan, Anirudh Raju, I-Fan Chen, Brian King, Katherine Powell, Melissa Saboowala, Karan Shetty, Andreas Stolcke

Figure 1 for Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Figure 2 for Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Figure 3 for Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Figure 4 for Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Abstract:As for other forms of AI, speech recognition has recently been examined with respect to performance disparities across different user cohorts. One approach to achieve fairness in speech recognition is to (1) identify speaker cohorts that suffer from subpar performance and (2) apply fairness mitigation measures targeting the cohorts discovered. In this paper, we report on initial findings with both discovery and mitigation of performance disparities using data from a product-scale AI assistant speech recognition system. We compare cohort discovery based on geographic and demographic information to a more scalable method that groups speakers without human labels, using speaker embedding technology. For fairness mitigation, we find that oversampling of underrepresented cohorts, as well as modeling speaker cohort membership by additional input variables, reduces the gap between top- and bottom-performing cohorts, without deteriorating overall recognition accuracy.

* Proc. Interspeech 2022

Via

Access Paper or Ask Questions

End-to-End Spoken Language Understanding using RNN-Transducer ASR

Jul 08, 2021

Anirudh Raju, Gautam Tiwari, Milind Rao, Pranav Dheram, Bryan Anderson, Zhe Zhang, Bach Bui, Ariya Rastrow

Figure 1 for End-to-End Spoken Language Understanding using RNN-Transducer ASR

Figure 2 for End-to-End Spoken Language Understanding using RNN-Transducer ASR

Figure 3 for End-to-End Spoken Language Understanding using RNN-Transducer ASR

Figure 4 for End-to-End Spoken Language Understanding using RNN-Transducer ASR

Abstract:We propose an end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance. It consists of a streaming recurrent neural network transducer (RNNT) based automatic speech recognition (ASR) model connected to a neural natural language understanding (NLU) model through a neural interface. This interface allows for end-to-end training using multi-task RNNT and NLU losses. Additionally, we introduce semantic sequence loss training for the joint RNNT-NLU system that allows direct optimization of non-differentiable SLU metrics. This end-to-end SLU model paradigm can leverage state-of-the-art advancements and pretrained models in both ASR and NLU research communities, outperforming recently proposed direct speech-to-semantics models, and conventional pipelined ASR and NLU systems. We show that this method improves both ASR and NLU metrics on both public SLU datasets and large proprietary datasets.

Via

Access Paper or Ask Questions

Do as I mean, not as I say: Sequence Loss Training for Spoken Language Understanding

Feb 12, 2021

Milind Rao, Pranav Dheram, Gautam Tiwari, Anirudh Raju, Jasha Droppo, Ariya Rastrow, Andreas Stolcke

Figure 1 for Do as I mean, not as I say: Sequence Loss Training for Spoken Language Understanding

Figure 2 for Do as I mean, not as I say: Sequence Loss Training for Spoken Language Understanding

Figure 3 for Do as I mean, not as I say: Sequence Loss Training for Spoken Language Understanding

Abstract:Spoken language understanding (SLU) systems extract transcriptions, as well as semantics of intent or named entities from speech, and are essential components of voice activated systems. SLU models, which either directly extract semantics from audio or are composed of pipelined automatic speech recognition (ASR) and natural language understanding (NLU) models, are typically trained via differentiable cross-entropy losses, even when the relevant performance metrics of interest are word or semantic error rates. In this work, we propose non-differentiable sequence losses based on SLU metrics as a proxy for semantic error and use the REINFORCE trick to train ASR and SLU models with this loss. We show that custom sequence loss training is the state-of-the-art on open SLU datasets and leads to 6% relative improvement in both ASR and NLU performance metrics on large proprietary datasets. We also demonstrate how the semantic sequence loss training paradigm can be used to update ASR and SLU models without transcripts, using semantic feedback alone.

* Proc. IEEE ICASSP 2021

Via

Access Paper or Ask Questions

Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces

Aug 14, 2020

Milind Rao, Anirudh Raju, Pranav Dheram, Bach Bui, Ariya Rastrow

Figure 1 for Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces

Figure 2 for Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces

Figure 3 for Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces

Figure 4 for Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces

Abstract:We consider the problem of spoken language understanding (SLU) of extracting natural language intents and associated slot arguments or named entities from speech that is primarily directed at voice assistants. Such a system subsumes both automatic speech recognition (ASR) as well as natural language understanding (NLU). An end-to-end joint SLU model can be built to a required specification opening up the opportunity to deploy on hardware constrained scenarios like devices enabling voice assistants to work offline, in a privacy preserving manner, whilst also reducing server costs. We first present models that extract utterance intent directly from speech without intermediate text output. We then present a compositional model, which generates the transcript using the Listen Attend Spell ASR system and then extracts interpretation using a neural NLU model. Finally, we contrast these methods to a jointly trained end-to-end joint SLU model, consisting of ASR and NLU subsystems which are connected by a neural network based interface instead of text, that produces transcripts as well as NLU interpretation. We show that the jointly trained model shows improvements to ASR incorporating semantic information from NLU and also improves NLU by exposing it to ASR confusion encoded in the hidden layer.

* Proceedings of INTERSPEECH

Via

Access Paper or Ask Questions