Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Grant Strimel

Multi-Modal Retrieval For Large Language Model Based Speech Recognition

Jun 13, 2024

Jari Kolehmainen, Aditya Gourav, Prashanth Gurunath Shivakumar, Yile Gu, Ankur Gandhe, Ariya Rastrow, Grant Strimel, Ivan Bulyko

Figure 1 for Multi-Modal Retrieval For Large Language Model Based Speech Recognition

Figure 2 for Multi-Modal Retrieval For Large Language Model Based Speech Recognition

Figure 3 for Multi-Modal Retrieval For Large Language Model Based Speech Recognition

Figure 4 for Multi-Modal Retrieval For Large Language Model Based Speech Recognition

Abstract:Retrieval is a widely adopted approach for improving language models leveraging external information. As the field moves towards multi-modal large language models, it is important to extend the pure text based methods to incorporate other modalities in retrieval as well for applications across the wide spectrum of machine learning tasks and data types. In this work, we propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques. We demonstrate the effectiveness of our retrieval approaches empirically by applying them to automatic speech recognition tasks with access to external information. Under this setting, we show that speech-based multi-modal retrieval outperforms text based retrieval, and yields up to 50 % improvement in word error rate over the multi-modal language model baseline. Furthermore, we achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.

Via

Access Paper or Ask Questions

PROCTER: PROnunciation-aware ConTextual adaptER for personalized speech recognition in neural transducers

Mar 30, 2023

Rahul Pandey, Roger Ren, Qi Luo, Jing Liu, Ariya Rastrow, Ankur Gandhe, Denis Filimonov, Grant Strimel, Andreas Stolcke, Ivan Bulyko

Figure 1 for PROCTER: PROnunciation-aware ConTextual adaptER for personalized speech recognition in neural transducers

Figure 2 for PROCTER: PROnunciation-aware ConTextual adaptER for personalized speech recognition in neural transducers

Figure 3 for PROCTER: PROnunciation-aware ConTextual adaptER for personalized speech recognition in neural transducers

Figure 4 for PROCTER: PROnunciation-aware ConTextual adaptER for personalized speech recognition in neural transducers

Abstract:End-to-End (E2E) automatic speech recognition (ASR) systems used in voice assistants often have difficulties recognizing infrequent words personalized to the user, such as names and places. Rare words often have non-trivial pronunciations, and in such cases, human knowledge in the form of a pronunciation lexicon can be useful. We propose a PROnunCiation-aware conTextual adaptER (PROCTER) that dynamically injects lexicon knowledge into an RNN-T model by adding a phonemic embedding along with a textual embedding. The experimental results show that the proposed PROCTER architecture outperforms the baseline RNN-T model by improving the word error rate (WER) by 44% and 57% when measured on personalized entities and personalized rare entities, respectively, while increasing the model size (number of trainable parameters) by only 1%. Furthermore, when evaluated in a zero-shot setting to recognize personalized device names, we observe 7% WER improvement with PROCTER, as compared to only 1% WER improvement with text-only contextual attention

* To appear in Proc. IEEE ICASSP

Via

Access Paper or Ask Questions