Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sepand Mavandadi

A Deliberation-based Joint Acoustic and Text Decoder

Mar 23, 2023

Sepand Mavandadi, Tara N. Sainath, Ke Hu, Zelin Wu

Abstract:We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data. Previously, the joint acoustic and text decoder (JATD) has shown promising results through the use of text data during model training and the recently introduced deliberation architecture has reduced recognition errors by leveraging first-pass decoding results. Our method, dubbed Deliberation-JATD, combines the spelling correcting abilities of deliberation with JATD's use of unpaired text data to further improve performance. The proposed model produces substantial gains across multiple test sets, especially those focused on rare words, where it reduces word error rate (WER) by between 12% and 22.5% relative. This is done without increasing model size or requiring multi-stage training, making Deliberation-JATD an efficient candidate for on-device applications.

* Interspeech 2021

Via

Access Paper or Ask Questions

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Sep 13, 2022

Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-yiin Chang, Parisa Haghani

Figure 1 for Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Figure 2 for Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Figure 3 for Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Figure 4 for Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Abstract:Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context. By leveraging such differences in the right-contexts and a streaming implementation of statistics pooling, the proposed method can achieve accurate streaming LID prediction with little extra test-time cost. Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same second-pass WER as that obtained by including oracle LID in the input.

Via

Access Paper or Ask Questions

Improving Deliberation by Text-Only and Semi-Supervised Training

Jun 29, 2022

Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang

Figure 1 for Improving Deliberation by Text-Only and Semi-Supervised Training

Figure 2 for Improving Deliberation by Text-Only and Semi-Supervised Training

Figure 3 for Improving Deliberation by Text-Only and Semi-Supervised Training

Figure 4 for Improving Deliberation by Text-Only and Semi-Supervised Training

Abstract:Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text and speech data. In this work, we propose incorporating text-only and semi-supervised training into an attention-based deliberation model. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, and large-scale text-to-speech and audio-only utterances using joint acoustic and text decoder (JATD) and semi-supervised training, we achieved 4%-12% WER reduction for various tasks compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the Google Voice Search WER by 11% relative. We show that the deliberation model also achieves a positive human side-by-side evaluation compared to the state-of-the-art LM rescorer with reasonable endpointer latencies.

* Accepted by Interspeech 2022

Via

Access Paper or Ask Questions

Improving Rare Word Recognition with LM-aware MWER Training

Apr 15, 2022

Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser(+3 more)

Figure 1 for Improving Rare Word Recognition with LM-aware MWER Training

Figure 2 for Improving Rare Word Recognition with LM-aware MWER Training

Figure 3 for Improving Rare Word Recognition with LM-aware MWER Training

Figure 4 for Improving Rare Word Recognition with LM-aware MWER Training

Abstract:Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups. In this work, we introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework, to mitigate the training versus inference gap regarding the use of LMs. For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10\% relative improvement over the model trained with standard MWER on voice search test sets containing rare words. For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner. This model achieves the same rescoring WER as regular MWER-trained model, but without the need for sweeping fusion weights.

* In submission to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

Aug 25, 2020

Cal Peyser, Sepand Mavandadi, Tara N. Sainath, James Apfel, Ruoming Pang, Shankar Kumar

Figure 1 for Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

Figure 2 for Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

Figure 3 for Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

Figure 4 for Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

Abstract:End-to-end (E2E) automatic speech recognition (ASR) systems lack the distinct language model (LM) component that characterizes traditional speech systems. While this simplifies the model architecture, it complicates the task of incorporating text-only data into training, which is important to the recognition of tail words that do not occur often in audio-text pairs. While shallow fusion has been proposed as a method for incorporating a pre-trained LM into an E2E model at inference time, it has not yet been explored for very large text corpora, and it has been shown to be very sensitive to hyperparameter settings in the beam search. In this work, we apply shallow fusion to incorporate a very large text corpus into a state-of-the-art E2EASR model. We explore the impact of model size and show that intelligent pruning of the training set can be more effective than increasing the parameter count. Additionally, we show that incorporating the LM in minimum word error rate (MWER) fine tuning makes shallow fusion far less dependent on optimal hyperparameter settings, reducing the difficulty of that tuning problem.

Via

Access Paper or Ask Questions