Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jinru Su

Contextual Adapters for Personalized Speech Recognition in Neural Transducers

May 26, 2022

Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Feng-Ju Chang, Jing Liu, Jinru Su, Grant P. Strimel, Athanasios Mouchtaris, Siegfried Kunzmann

Figure 1 for Contextual Adapters for Personalized Speech Recognition in Neural Transducers

Figure 2 for Contextual Adapters for Personalized Speech Recognition in Neural Transducers

Figure 3 for Contextual Adapters for Personalized Speech Recognition in Neural Transducers

Figure 4 for Contextual Adapters for Personalized Speech Recognition in Neural Transducers

Abstract:Personal rare word recognition in end-to-end Automatic Speech Recognition (E2E ASR) models is a challenge due to the lack of training data. A standard way to address this issue is with shallow fusion methods at inference time. However, due to their dependence on external language models and the deterministic approach to weight boosting, their performance is limited. In this paper, we propose training neural contextual adapters for personalization in neural transducer based ASR models. Our approach can not only bias towards user-defined words, but also has the flexibility to work with pretrained ASR models. Using an in-house dataset, we demonstrate that contextual adapters can be applied to any general purpose pretrained ASR model to improve personalization. Our method outperforms shallow fusion, while retaining functionality of the pretrained models by not altering any of the model weights. We further show that the adapter style training is superior to full-fine-tuning of the ASR models on datasets with user-defined content.

* Accepted at ICASSP 2022

Via

Access Paper or Ask Questions

Amortized Neural Networks for Low-Latency Speech Recognition

Aug 03, 2021

Jonathan Macoskey, Grant P. Strimel, Jinru Su, Ariya Rastrow

Figure 1 for Amortized Neural Networks for Low-Latency Speech Recognition

Figure 2 for Amortized Neural Networks for Low-Latency Speech Recognition

Figure 3 for Amortized Neural Networks for Low-Latency Speech Recognition

Abstract:We introduce Amortized Neural Networks (AmNets), a compute cost- and latency-aware network architecture particularly well-suited for sequence modeling tasks. We apply AmNets to the Recurrent Neural Network Transducer (RNN-T) to reduce compute cost and latency for an automatic speech recognition (ASR) task. The AmNets RNN-T architecture enables the network to dynamically switch between encoder branches on a frame-by-frame basis. Branches are constructed with variable levels of compute cost and model capacity. Here, we achieve variable compute for two well-known candidate techniques: one using sparse pruning and the other using matrix factorization. Frame-by-frame switching is determined by an arbitrator network that requires negligible compute overhead. We present results using both architectures on LibriSpeech data and show that our proposed architecture can reduce inference cost by up to 45\% and latency to nearly real-time without incurring a loss in accuracy.

* Accepted at Interspeech 2021

Via

Access Paper or Ask Questions