Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ernest Pusateri

Retrieval Augmented Correction of Named Entity Speech Recognition Errors

Sep 09, 2024

Ernest Pusateri, Anmol Walia, Anirudh Kashi, Bortik Bandyopadhyay, Nadia Hyder, Sayantan Mahinder, Raviteja Anantha, Daben Liu, Sashank Gondala

Figure 1 for Retrieval Augmented Correction of Named Entity Speech Recognition Errors

Figure 2 for Retrieval Augmented Correction of Named Entity Speech Recognition Errors

Figure 3 for Retrieval Augmented Correction of Named Entity Speech Recognition Errors

Figure 4 for Retrieval Augmented Correction of Named Entity Speech Recognition Errors

Abstract:In recent years, end-to-end automatic speech recognition (ASR) systems have proven themselves remarkably accurate and performant, but these systems still have a significant error rate for entity names which appear infrequently in their training data. In parallel to the rise of end-to-end ASR systems, large language models (LLMs) have proven to be a versatile tool for various natural language processing (NLP) tasks. In NLP tasks where a database of relevant knowledge is available, retrieval augmented generation (RAG) has achieved impressive results when used with LLMs. In this work, we propose a RAG-like technique for correcting speech recognition entity name errors. Our approach uses a vector database to index a set of relevant entities. At runtime, database queries are generated from possibly errorful textual ASR hypotheses, and the entities retrieved using these queries are fed, along with the ASR hypotheses, to an LLM which has been adapted to correct ASR errors. Overall, our best system achieves 33%-39% relative word error rate reductions on synthetic test sets focused on voice assistant queries of rare music entities without regressing on the STOP test set, a publicly available voice assistant test set covering many domains.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization

Oct 16, 2023

Zhihong Lei, Ernest Pusateri, Shiyi Han, Leo Liu, Mingbin Xu, Tim Ng, Ruchir Travadi, Youyuan Zhang, Mirko Hannemann, Man-Hung Siu(+1 more)

Figure 1 for Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization

Figure 2 for Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization

Figure 3 for Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization

Figure 4 for Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization

Abstract:Recent advances in deep learning and automatic speech recognition have improved the accuracy of end-to-end speech recognition systems, but recognition of personal content such as contact names remains a challenge. In this work, we describe our personalization solution for an end-to-end speech recognition system based on connectionist temporal classification. Building on previous work, we present a novel method for generating additional subword tokenizations for personal entities from their pronunciations. We show that using this technique in combination with two established techniques, contextual biasing and wordpiece prior normalization, we are able to achieve personal named entity accuracy on par with a competitive hybrid system.

Via

Access Paper or Ask Questions

Acoustic Model Fusion for End-to-end Speech Recognition

Oct 10, 2023

Zhihong Lei, Mingbin Xu, Shiyi Han, Leo Liu, Zhen Huang, Tim Ng, Yuanyuan Zhang, Ernest Pusateri, Mirko Hannemann, Yaqiao Deng(+1 more)

Figure 1 for Acoustic Model Fusion for End-to-end Speech Recognition

Figure 2 for Acoustic Model Fusion for End-to-end Speech Recognition

Figure 3 for Acoustic Model Fusion for End-to-end Speech Recognition

Figure 4 for Acoustic Model Fusion for End-to-end Speech Recognition

Abstract:Recent advances in deep learning and automatic speech recognition (ASR) have enabled the end-to-end (E2E) ASR system and boosted the accuracy to a new level. The E2E systems implicitly model all conventional ASR components, such as the acoustic model (AM) and the language model (LM), in a single network trained on audio-text pairs. Despite this simpler system architecture, fusing a separate LM, trained exclusively on text corpora, into the E2E system has proven to be beneficial. However, the application of LM fusion presents certain drawbacks, such as its inability to address the domain mismatch issue inherent to the internal AM. Drawing inspiration from the concept of LM fusion, we propose the integration of an external AM into the E2E system to better address the domain mismatch. By implementing this novel approach, we have achieved a significant reduction in the word error rate, with an impressive drop of up to 14.3% across varied test sets. We also discovered that this AM fusion approach is particularly beneficial in enhancing named entity recognition.

Via

Access Paper or Ask Questions

Neural Language Model Pruning for Automatic Speech Recognition

Oct 05, 2023

Leonardo Emili, Thiago Fraga-Silva, Ernest Pusateri, Markus Nußbaum-Thom, Youssef Oualil

Figure 1 for Neural Language Model Pruning for Automatic Speech Recognition

Figure 2 for Neural Language Model Pruning for Automatic Speech Recognition

Figure 3 for Neural Language Model Pruning for Automatic Speech Recognition

Figure 4 for Neural Language Model Pruning for Automatic Speech Recognition

Abstract:We study model pruning methods applied to Transformer-based neural network language models for automatic speech recognition. We explore three aspects of the pruning frame work, namely criterion, method and scheduler, analyzing their contribution in terms of accuracy and inference speed. To the best of our knowledge, such in-depth analyses on large-scale recognition systems has not been reported in the literature. In addition, we propose a variant of low-rank approximation suitable for incrementally compressing models, and delivering multiple models with varied target sizes. Among other results, we show that a) data-driven pruning outperforms magnitude-driven in several scenarios; b) incremental pruning achieves higher accuracy compared to one-shot pruning, especially when targeting smaller sizes; and c) low-rank approximation presents the best trade-off between size reduction and inference speed-up for moderate compression.

* 8 pages, 3 figures

Via

Access Paper or Ask Questions

Space-Efficient Representation of Entity-centric Query Language Models

Jun 29, 2022

Christophe Van Gysel, Mirko Hannemann, Ernest Pusateri, Youssef Oualil, Ilya Oparin

Figure 1 for Space-Efficient Representation of Entity-centric Query Language Models

Figure 2 for Space-Efficient Representation of Entity-centric Query Language Models

Figure 3 for Space-Efficient Representation of Entity-centric Query Language Models

Figure 4 for Space-Efficient Representation of Entity-centric Query Language Models

Abstract:Virtual assistants make use of automatic speech recognition (ASR) to help users answer entity-centric queries. However, spoken entity recognition is a difficult problem, due to the large number of frequently-changing named entities. In addition, resources available for recognition are constrained when ASR is performed on-device. In this work, we investigate the use of probabilistic grammars as language models within the finite-state transducer (FST) framework. We introduce a deterministic approximation to probabilistic grammars that avoids the explicit expansion of non-terminals at model creation time, integrates directly with the FST framework, and is complementary to n-gram models. We obtain a 10% relative word error rate improvement on long tail entity queries compared to when a similarly-sized n-gram model is used without our method.

* Interspeech '22

Via

Access Paper or Ask Questions

A Discriminative Entity-Aware Language Model for Virtual Assistants

Jun 21, 2021

Mandana Saebi, Ernest Pusateri, Aaksha Meghawat, Christophe Van Gysel

Figure 1 for A Discriminative Entity-Aware Language Model for Virtual Assistants

Figure 2 for A Discriminative Entity-Aware Language Model for Virtual Assistants

Figure 3 for A Discriminative Entity-Aware Language Model for Virtual Assistants

Figure 4 for A Discriminative Entity-Aware Language Model for Virtual Assistants

Abstract:High-quality automatic speech recognition (ASR) is essential for virtual assistants (VAs) to work well. However, ASR often performs poorly on VA requests containing named entities. In this work, we start from the observation that many ASR errors on named entities are inconsistent with real-world knowledge. We extend previous discriminative n-gram language modeling approaches to incorporate real-world knowledge from a Knowledge Graph (KG), using features that capture entity type-entity and entity-entity relationships. We apply our model through an efficient lattice rescoring process, achieving relative sentence error rate reductions of more than 25% on some synthesized test sets covering less popular entities, with minimal degradation on a uniformly sampled VA test set.

* To appear in Interspeech 2021

Via

Access Paper or Ask Questions

Error-driven Pruning of Language Models for Virtual Assistants

Feb 14, 2021

Sashank Gondala, Lyan Verwimp, Ernest Pusateri, Manos Tsagkias, Christophe Van Gysel

Figure 1 for Error-driven Pruning of Language Models for Virtual Assistants

Figure 2 for Error-driven Pruning of Language Models for Virtual Assistants

Abstract:Language models (LMs) for virtual assistants (VAs) are typically trained on large amounts of data, resulting in prohibitively large models which require excessive memory and/or cannot be used to serve user requests in real-time. Entropy pruning results in smaller models but with significant degradation of effectiveness in the tail of the user request distribution. We customize entropy pruning by allowing for a keep list of infrequent n-grams that require a more relaxed pruning threshold, and propose three methods to construct the keep list. Each method has its own advantages and disadvantages with respect to LM size, ASR accuracy and cost of constructing the keep list. Our best LM gives 8% average Word Error Rate (WER) reduction on a targeted test set, but is 3 times larger than the baseline. We also propose discriminative methods to reduce the size of the LM while retaining the majority of the WER gains achieved by the largest LM.

* ICASSP '21. The 46th International IEEE Conference on Acoustics, Speech, and Signal Processing

Via

Access Paper or Ask Questions

Predicting Entity Popularity to Improve Spoken Entity Recognition by Virtual Assistants

May 26, 2020

Christophe Van Gysel, Manos Tsagkias, Ernest Pusateri, Ilya Oparin

Figure 1 for Predicting Entity Popularity to Improve Spoken Entity Recognition by Virtual Assistants

Figure 2 for Predicting Entity Popularity to Improve Spoken Entity Recognition by Virtual Assistants

Figure 3 for Predicting Entity Popularity to Improve Spoken Entity Recognition by Virtual Assistants

Abstract:We focus on improving the effectiveness of a Virtual Assistant (VA) in recognizing emerging entities in spoken queries. We introduce a method that uses historical user interactions to forecast which entities will gain in popularity and become trending, and it subsequently integrates the predictions within the Automated Speech Recognition (ASR) component of the VA. Experiments show that our proposed approach results in a 20% relative reduction in errors on emerging entity name utterances without degrading the overall recognition quality of the system.

* SIGIR '20. The 43rd International ACM SIGIR Conference on Research & Development in Information Retrieval

Via

Access Paper or Ask Questions

Connecting and Comparing Language Model Interpolation Techniques

Aug 26, 2019

Ernest Pusateri, Christophe Van Gysel, Rami Botros, Sameer Badaskar, Mirko Hannemann, Youssef Oualil, Ilya Oparin

Figure 1 for Connecting and Comparing Language Model Interpolation Techniques

Figure 2 for Connecting and Comparing Language Model Interpolation Techniques

Figure 3 for Connecting and Comparing Language Model Interpolation Techniques

Abstract:In this work, we uncover a theoretical connection between two language model interpolation techniques, count merging and Bayesian interpolation. We compare these techniques as well as linear interpolation in three scenarios with abundant training data per component model. Consistent with prior work, we show that both count merging and Bayesian interpolation outperform linear interpolation. We include the first (to our knowledge) published comparison of count merging and Bayesian interpolation, showing that the two techniques perform similarly. Finally, we argue that other considerations will make Bayesian interpolation the preferred approach in most circumstances.

Via

Access Paper or Ask Questions