Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cal Peyser

Text Injection for Neural Contextual Biasing

Jun 05, 2024

Zhong Meng, Zelin Wu, Rohit Prabhavalkar, Cal Peyser, Weiran Wang, Nanxin Chen, Tara N. Sainath, Bhuvana Ramabhadran

Figure 1 for Text Injection for Neural Contextual Biasing

Figure 2 for Text Injection for Neural Contextual Biasing

Figure 3 for Text Injection for Neural Contextual Biasing

Figure 4 for Text Injection for Neural Contextual Biasing

Abstract:Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker's context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI leverages not only the paired speech-text data, but also a much larger corpus of unpaired text to optimize the ASR model and its biasing component. Unpaired text is converted into speech-like representations and used to guide the model's attention towards relevant bias phrases. Moreover, we introduce a contextual text-injected (CTI) minimum word error rate (MWER) training, which minimizes the expected WER caused by contextual biasing when unpaired text is injected into the model. Experiments show that CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model. CTI-MWER provides a further relative improvement of 23.5%.

* Interspeech 2024, Kos Island, Greece
* 5 pages, 1 figure

Via

Access Paper or Ask Questions

Improving Joint Speech-Text Representations Without Alignment

Aug 11, 2023

Cal Peyser, Zhong Meng, Ke Hu, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho

Figure 1 for Improving Joint Speech-Text Representations Without Alignment

Figure 2 for Improving Joint Speech-Text Representations Without Alignment

Figure 3 for Improving Joint Speech-Text Representations Without Alignment

Figure 4 for Improving Joint Speech-Text Representations Without Alignment

Abstract:The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these methods show promise, they have required special treatment of the sequence-length mismatch inherent in speech and text, either by up-sampling heuristics or an explicit alignment model. In this work, we offer evidence that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length, and argue that consistency losses could forgive length differences and simply assume the best alignment. We show that such a loss improves downstream WER in both a large-parameter monolingual and multilingual system.

* INTERSPEECH 2023

Via

Access Paper or Ask Questions

A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale

Apr 19, 2023

Cal Peyser, Michael Picheny, Kyunghyun Cho, Rohit Prabhavalkar, Ronny Huang, Tara Sainath

Figure 1 for A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale

Figure 2 for A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale

Figure 3 for A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale

Figure 4 for A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale

Abstract:Unpaired text and audio injection have emerged as dominant methods for improving ASR performance in the absence of a large labeled corpus. However, little guidance exists on deploying these methods to improve production ASR systems that are trained on very large supervised corpora and with realistic requirements like a constrained model size and CPU budget, streaming capability, and a rich lattice for rescoring and for downstream NLU tasks. In this work, we compare three state-of-the-art semi-supervised methods encompassing both unpaired text and audio as well as several of their combinations in a controlled setting using joint training. We find that in our setting these methods offer many improvements beyond raw WER, including substantial gains in tail-word WER, decoder computation during inference, and lattice density.

* 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Via

Access Paper or Ask Questions

Dual Learning for Large Vocabulary On-Device ASR

Jan 11, 2023

Cal Peyser, Ronny Huang, Tara Sainath, Rohit Prabhavalkar, Michael Picheny, Kyunghyun Cho

Abstract:Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once. In this scheme, each model is used to generate pseudo-labels for unlabeled examples that are used to train the other model. Dual learning has seen some use in speech processing by pairing ASR and TTS as dual tasks. However, these results mostly address only the case of using unpaired examples to compensate for very small supervised datasets, and mostly on large, non-streaming models. Dual learning has not yet been proven effective for using unsupervised data to improve realistic on-device streaming models that are already trained on large supervised corpora. We provide this missing piece though an analysis of an on-device-sized streaming conformer trained on the entirety of Librispeech, showing relative WER improvements of 10.7%/5.2% without an LM and 11.7%/16.4% with an LM.

Via

Access Paper or Ask Questions

E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Nov 28, 2022

W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman

Figure 1 for E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Figure 2 for E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Figure 3 for E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Figure 4 for E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Abstract:We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a novel dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a real-world long-form captioning task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over a baseline VAD-based segmenter with the same cascaded encoder.

Via

Access Paper or Ask Questions

Towards Disentangled Speech Representations

Aug 28, 2022

Cal Peyser, Ronny Huang Andrew Rosenberg Tara N. Sainath, Michael Picheny, Kyunghyun Cho

Figure 1 for Towards Disentangled Speech Representations

Figure 2 for Towards Disentangled Speech Representations

Figure 3 for Towards Disentangled Speech Representations

Figure 4 for Towards Disentangled Speech Representations

Abstract:The careful construction of audio representations has become a dominant feature in the design of approaches to many speech tasks. Increasingly, such approaches have emphasized "disentanglement", where a representation contains only parts of the speech signal relevant to transcription while discarding irrelevant information. In this paper, we construct a representation learning task based on joint modeling of ASR and TTS, and seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not. We present empirical evidence that successfully finding such a representation is tied to the randomness inherent in training. We then make the observation that these desired, disentangled solutions to the optimization problem possess unique statistical properties. Finally, we show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task. These observations motivate a novel approach to learning effective audio representations.

Via

Access Paper or Ask Questions

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

Apr 22, 2022

W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

Figure 1 for E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

Figure 2 for E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

Figure 3 for E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

Figure 4 for E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

Abstract:Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundary locations based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set an alarm for... 5 o'clock"). We propose to replace the VAD with an end-to-end ASR model capable of predicting segment boundaries in a streaming fashion, allowing the segmentation decision to be conditioned not only on better acoustic features but also on semantic features from the decoded text with negligible extra computation. In experiments on real world long-form audio (YouTube) with lengths of up to 30 minutes, we demonstrate 8.5% relative WER improvement and 250 ms reduction in median end-of-segment latency compared to the VAD segmenter baseline on a state-of-the-art Conformer RNN-T model.

Via

Access Paper or Ask Questions

Improving Rare Word Recognition with LM-aware MWER Training

Apr 15, 2022

Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser(+3 more)

Figure 1 for Improving Rare Word Recognition with LM-aware MWER Training

Figure 2 for Improving Rare Word Recognition with LM-aware MWER Training

Figure 3 for Improving Rare Word Recognition with LM-aware MWER Training

Figure 4 for Improving Rare Word Recognition with LM-aware MWER Training

Abstract:Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups. In this work, we introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework, to mitigate the training versus inference gap regarding the use of LMs. For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10\% relative improvement over the model trained with standard MWER on voice search test sets containing rare words. For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner. This model achieves the same rescoring WER as regular MWER-trained model, but without the need for sweeping fusion weights.

* In submission to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Mar 09, 2022

W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar

Figure 1 for Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Figure 2 for Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Figure 3 for Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Figure 4 for Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Abstract:Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather"). We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance. First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences. Second, to encourage rare-word exposure, we explicitly filter for words rare in the acoustic data. Finally, we tackle domain-mismatch via perplexity-based contrastive selection, filtering for examples matched to the target domain. We down-select a large corpus of web search queries by a factor of 53x and achieve better LM perplexities than without down-selection. When shallow-fused with a state-of-the-art, production speech engine, our LM achieves WER reductions of up to 24% relative on rare-word sentences (without changing overall WER) compared to a baseline LM trained on the raw corpus. These gains are further validated through favorable side-by-side evaluations on live voice search traffic.

Via

Access Paper or Ask Questions

Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

Apr 09, 2021

W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman

Figure 1 for Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

Figure 2 for Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

Figure 3 for Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

Figure 4 for Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

Abstract:We introduce Lookup-Table Language Models (LookupLM), a method for scaling up the size of RNN language models with only a constant increase in the floating point operations, by increasing the expressivity of the embedding table. In particular, we instantiate an (additional) embedding table which embeds the previous n-gram token sequence, rather than a single token. This allows the embedding table to be scaled up arbitrarily -- with a commensurate increase in performance -- without changing the token vocabulary. Since embeddings are sparsely retrieved from the table via a lookup; increasing the size of the table adds neither extra operations to each forward pass nor extra parameters that need to be stored on limited GPU/TPU memory. We explore scaling n-gram embedding tables up to nearly a billion parameters. When trained on a 3-billion sentence corpus, we find that LookupLM improves long tail log perplexity by 2.44 and long tail WER by 23.4% on a downstream speech recognition task over a standard RNN language model baseline, an improvement comparable to a scaling up the baseline by 6.2x the number of floating point operations.

* Submitted to Interspeech 2021

Via

Access Paper or Ask Questions