Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vladimir Bataev

NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding

May 28, 2025

Vladimir Bataev, Andrei Andrusenko, Lilit Grigoryan, Aleksandr Laptev, Vitaly Lavrukhin, Boris Ginsburg

Abstract:Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM is open-sourced.

* Accepted to Interspeech 2025

Via

Access Paper or Ask Questions

WIND: Accelerated RNN-T Decoding with Windowed Inference for Non-blank Detection

May 19, 2025

Hainan Xu, Vladimir Bataev, Lilit Grigoryan, Boris Ginsburg

Abstract:We propose Windowed Inference for Non-blank Detection (WIND), a novel strategy that significantly accelerates RNN-T inference without compromising model accuracy. During model inference, instead of processing frames sequentially, WIND processes multiple frames simultaneously within a window in parallel, allowing the model to quickly locate non-blank predictions during decoding, resulting in significant speed-ups. We implement WIND for greedy decoding, batched greedy decoding with label-looping techniques, and also propose a novel beam-search decoding method. Experiments on multiple datasets with different conditions show that our method, when operating in greedy modes, speeds up as much as 2.4X compared to the baseline sequential approach while maintaining identical Word Error Rate (WER) performance. Our beam-search algorithm achieves slightly better accuracy than alternative methods, with significantly improved speed. We will open-source our WIND implementation.

Via

Access Paper or Ask Questions

RNN-Transducer-based Losses for Speech Recognition on Noisy Targets

Apr 09, 2025

Vladimir Bataev

Abstract:Training speech recognition systems on noisy transcripts is a significant challenge in industrial pipelines, where datasets are enormous and ensuring accurate transcription for every instance is difficult. In this work, we introduce novel loss functions to mitigate the impact of transcription errors in RNN-Transducer models. Our Star-Transducer loss addresses deletion errors by incorporating "skip frame" transitions in the loss lattice, restoring over 90% of the system's performance compared to models trained with accurate transcripts. The Bypass-Transducer loss uses "skip token" transitions to tackle insertion errors, recovering more than 60% of the quality. Finally, the Target-Robust Transducer loss merges these approaches, offering robust performance against arbitrary errors. Experimental results demonstrate that the Target-Robust Transducer loss significantly improves RNN-T performance on noisy data by restoring over 70% of the quality compared to well-transcribed data.

* Final Project Report, Bachelor's Degree in Computer Science, University of London, March 2024

Via

Access Paper or Ask Questions

TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer

Jan 10, 2025

Vladimir Bataev, Subhankar Ghosh, Vitaly Lavrukhin, Jason Li

Abstract:This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are employed to learn monotonic alignments and allow for avoiding using explicit duration predictors. Neural audio codecs efficiently compress audio into discrete codes, revealing the possibility of applying text modeling approaches to speech generation. However, the complexity of predicting multiple tokens per frame from several codebooks, as necessitated by audio codec models with residual quantizers, poses a significant challenge. The proposed system first uses a transducer architecture to learn monotonic alignments between tokenized text and speech codec tokens for the first codebook. Next, a non-autoregressive Transformer predicts the remaining codes using the alignment extracted from transducer loss. The proposed system is trained end-to-end. We show that TTS-Transducer is a competitive and robust alternative to contemporary TTS systems.

* Accepted by ICASSP 2025

Via

Access Paper or Ask Questions

Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR

Oct 03, 2024

Hainan Xu, Travis M. Bartley, Vladimir Bataev, Boris Ginsburg

Abstract:We present \textbf{H}ybrid-\textbf{A}utoregressive \textbf{IN}ference Tr\textbf{AN}sducers (HAINAN), a novel architecture for speech recognition that extends the Token-and-Duration Transducer (TDT) model. Trained with randomly masked predictor network outputs, HAINAN supports both autoregressive inference with all network components and non-autoregressive inference without the predictor. Additionally, we propose a novel semi-autoregressive inference paradigm that first generates an initial hypothesis using non-autoregressive inference, followed by refinement steps where each token prediction is regenerated using parallelized autoregression on the initial hypothesis. Experiments on multiple datasets across different languages demonstrate that HAINAN achieves efficiency parity with CTC in non-autoregressive mode and with TDT in autoregressive mode. In terms of accuracy, autoregressive HAINAN outperforms TDT and RNN-T, while non-autoregressive HAINAN significantly outperforms CTC. Semi-autoregressive inference further enhances the model's accuracy with minimal computational overhead, and even outperforms TDT results in some cases. These results highlight HAINAN's flexibility in balancing accuracy and speed, positioning it as a strong candidate for real-world speech recognition applications.

Via

Access Paper or Ask Questions

Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter

Jun 11, 2024

Andrei Andrusenko, Aleksandr Laptev, Vladimir Bataev, Vitaly Lavrukhin, Boris Ginsburg

Figure 1 for Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter

Figure 2 for Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter

Figure 3 for Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter

Figure 4 for Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter

Abstract:Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and Transducer (RNN-T) ASR models. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The valid candidates then replace their greedy recognition counterparts in corresponding frame intervals. A Hybrid Transducer-CTC model enables the CTC-WS application for the Transducer model. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER compared to baseline methods. The proposed method is publicly available in the NVIDIA NeMo toolkit.

* Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

Label-Looping: Highly Efficient Decoding for Transducers

Jun 10, 2024

Vladimir Bataev, Hainan Xu, Daniel Galvez, Vitaly Lavrukhin, Boris Ginsburg

Figure 1 for Label-Looping: Highly Efficient Decoding for Transducers

Figure 2 for Label-Looping: Highly Efficient Decoding for Transducers

Figure 3 for Label-Looping: Highly Efficient Decoding for Transducers

Figure 4 for Label-Looping: Highly Efficient Decoding for Transducers

Abstract:This paper introduces a highly efficient greedy decoding algorithm for Transducer inference. We propose a novel data structure using CUDA tensors to represent partial hypotheses in a batch that supports parallelized hypothesis manipulations. During decoding, our algorithm maximizes GPU parallelism by adopting a nested-loop design, where the inner loop consumes all blank predictions, while non-blank predictions are handled in the outer loop. Our algorithm is general-purpose and can work with both conventional Transducers and Token-and-Duration Transducers. Experiments show that the label-looping algorithm can bring a speedup up to 2.0X compared to conventional batched decoding algorithms when using batch size 32, and can be combined with other compiler or GPU call-related techniques to bring more speedup. We will open-source our implementation to benefit the research community.

Via

Access Paper or Ask Questions

Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

Jun 06, 2024

Daniel Galvez, Vladimir Bataev, Hainan Xu, Tim Kaldewey

Figure 1 for Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

Figure 2 for Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

Figure 3 for Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

Figure 4 for Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

Abstract:The vast majority of inference time for RNN Transducer (RNN-T) models today is spent on decoding. Current state-of-the-art RNN-T decoding implementations leave the GPU idle ~80% of the time. Leveraging a new CUDA 12.4 feature, CUDA graph conditional nodes, we present an exact GPU-based implementation of greedy decoding for RNN-T models that eliminates this idle time. Our optimizations speed up a 1.1 billion parameter RNN-T model end-to-end by a factor of 2.5x. This technique can applied to the "label looping" alternative greedy decoding algorithm as well, achieving 1.7x and 1.4x end-to-end speedups when applied to 1.1 billion parameter RNN-T and Token and Duration Transducer models respectively. This work enables a 1.1 billion parameter RNN-T model to run only 16% slower than a similarly sized CTC model, contradicting the common belief that RNN-T models are not suitable for high throughput inference. The implementation is available in NVIDIA NeMo.

* Interspeech 2024 Proceedings

Via

Access Paper or Ask Questions

Powerful and Extensible WFST Framework for RNN-Transducer Losses

Mar 18, 2023

Aleksandr Laptev, Vladimir Bataev, Igor Gitman, Boris Ginsburg

Abstract:This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the development of modifications for RNN-Transducer (RNN-T) loss. Existing implementations of RNN-T use CUDA-related code, which is hard to extend and debug. WFSTs are easy to construct and extend, and allow debugging through visualization. We introduce two WFST-powered RNN-T implementations: (1) "Compose-Transducer", based on a composition of the WFST graphs from acoustic and textual schema -- computationally competitive and easy to modify; (2) "Grid-Transducer", which constructs the lattice directly for further computations -- most compact, and computationally efficient. We illustrate the ease of extensibility through introduction of a new W-Transducer loss -- the adaptation of the Connectionist Temporal Classification with Wild Cards. W-Transducer (W-RNNT) consistently outperforms the standard RNN-T in a weakly-supervised data setup with missing parts of transcriptions at the beginning and end of utterances. All RNN-T losses are implemented with the k2 framework and are available in the NeMo toolkit.

* To appear in Proc. ICASSP 2023, June 04-10, 2023, Rhodes island, Greece. 5 pages, 5 figures, 3 tables

Via

Access Paper or Ask Questions

Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

Feb 27, 2023

Vladimir Bataev, Roman Korostik, Evgeny Shabalin, Vitaly Lavrukhin, Boris Ginsburg

Figure 1 for Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

Figure 2 for Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

Figure 3 for Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

Figure 4 for Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

Abstract:We propose an end-to-end ASR system that can be trained on transcribed speech data, text data, or a mixture of both. For text-only training, our extended ASR model uses an integrated auxiliary TTS block that creates mel spectrograms from the text. This block contains a conventional non-autoregressive text-to-mel-spectrogram generator augmented with a GAN enhancer to improve the spectrogram quality. The proposed system can improve the accuracy of the ASR model on a new domain by using text-only data, and allows to significantly surpass conventional audio-text training by using large text corpora.

Via

Access Paper or Ask Questions