Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Natalie Delworth

Rev.com

Reverb: Open-Source ASR and Diarization from Rev

Oct 04, 2024

Nishchal Bhandari, Danny Chen, Miguel Ángel del Río Fernández, Natalie Delworth, Jennifer Drexler Fox, Migüel Jetté, Quinten McNamara, Corey Miller, Ondřej Novotný, Ján Profant(+3 more)

Figure 1 for Reverb: Open-Source ASR and Diarization from Rev

Figure 2 for Reverb: Open-Source ASR and Diarization from Rev

Abstract:Today, we are open-sourcing our core speech recognition and diarization models for non-commercial use. We are releasing both a full production pipeline for developers as well as pared-down research models for experimentation. Rev hopes that these releases will spur research and innovation in the fast-moving domain of voice technology. The speech recognition models released today outperform all existing open source speech recognition models across a variety of long-form speech recognition domains.

Via

Access Paper or Ask Questions

Updated Corpora and Benchmarks for Long-Form Speech Recognition

Sep 26, 2023

Jennifer Drexler Fox, Desh Raj, Natalie Delworth, Quinn McNamara, Corey Miller, Migüel Jetté

Figure 1 for Updated Corpora and Benchmarks for Long-Form Speech Recognition

Figure 2 for Updated Corpora and Benchmarks for Long-Form Speech Recognition

Figure 3 for Updated Corpora and Benchmarks for Long-Form Speech Recognition

Figure 4 for Updated Corpora and Benchmarks for Long-Form Speech Recognition

Abstract:The vast majority of ASR research uses corpora in which both the training and test data have been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio is not segmented, leading to a mismatch between inference-time conditions and models trained on segmented utterances. In this paper, we re-release three standard ASR corpora - TED-LIUM 3, Gigapeech, and VoxPopuli-en - with updated transcription and alignments to enable their use for long-form ASR research. We use these reconstituted corpora to study the train-test mismatch problem for transducers and attention-based encoder-decoders (AEDs), confirming that AEDs are more susceptible to this issue. Finally, we benchmark a simple long-form training for these models, showing its efficacy for model robustness under this domain shift.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model

Sep 02, 2022

Jennifer Drexler Fox, Natalie Delworth

Figure 1 for Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model

Figure 2 for Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model

Figure 3 for Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model

Abstract:Contextual ASR, which takes a list of bias terms as input along with audio, has drawn recent interest as ASR use becomes more widespread. We are releasing contextual biasing lists to accompany the Earnings21 dataset, creating a public benchmark for this task. We present baseline results on this benchmark using a pretrained end-to-end ASR model from the WeNet toolkit. We show results for shallow fusion contextual biasing applied to two different decoding algorithms. Our baseline results confirm observations that end-to-end models struggle in particular with words that are rarely or never seen during training, and that existing shallow fusion techniques do not adequately address this problem. We propose an alternate spelling prediction model that improves recall of rare words by 34.7% relative and of out-of-vocabulary words by 97.2% relative, compared to contextual biasing without alternate spellings. This model is conceptually similar to ones used in prior work, but is simpler to implement as it does not rely on either a pronunciation dictionary or an existing text-to-speech system.

Via

Access Paper or Ask Questions

Earnings-21: A Practical Benchmark for ASR in the Wild

Apr 28, 2021

Miguel Del Rio, Natalie Delworth, Ryan Westerman, Michelle Huang, Nishchal Bhandari, Joseph Palakapilly, Quinten McNamara, Joshua Dong, Piotr Zelasko, Miguel Jette

Figure 1 for Earnings-21: A Practical Benchmark for ASR in the Wild

Figure 2 for Earnings-21: A Practical Benchmark for ASR in the Wild

Figure 3 for Earnings-21: A Practical Benchmark for ASR in the Wild

Figure 4 for Earnings-21: A Practical Benchmark for ASR in the Wild

Abstract:Commonly used speech corpora inadequately challenge academic and commercial ASR systems. In particular, speech corpora lack metadata needed for detailed analysis and WER measurement. In response, we present Earnings-21, a 39-hour corpus of earnings calls containing entity-dense speech from nine different financial sectors. This corpus is intended to benchmark ASR systems in the wild with special attention towards named entity recognition. We benchmark four commercial ASR models, two internal models built with open-source tools, and an open-source LibriSpeech model and discuss their differences in performance on Earnings-21. Using our recently released fstalign tool, we provide a candid analysis of each model's recognition capabilities under different partitions. Our analysis finds that ASR accuracy for certain NER categories is poor, presenting a significant impediment to transcript comprehension and usage. Earnings-21 bridges academic and commercial ASR system evaluation and enables further research on entity modeling and WER on real world audio.

* submitted to INTERSPEECH 2021 Update April 28th, 2021: We found and resolved an issue in our experimental evaluation that scored the LibriSpeech model at ~20% worse relative WER than the actual WER. The updated results do not affect our conclusions

Via

Access Paper or Ask Questions

Accented Speech Recognition: A Survey

Apr 21, 2021

Arthur Hinsvark, Natalie Delworth, Miguel Del Rio, Quinten McNamara, Joshua Dong, Ryan Westerman, Michelle Huang, Joseph Palakapilly, Jennifer Drexler, Ilya Pirkin(+2 more)

Figure 1 for Accented Speech Recognition: A Survey

Abstract:Automatic Speech Recognition (ASR) systems generalize poorly on accented speech. The phonetic and linguistic variability of accents present hard challenges for ASR systems today in both data collection and modeling strategies. The resulting bias in ASR performance across accents comes at a cost to both users and providers of ASR. We present a survey of current promising approaches to accented speech recognition and highlight the key challenges in the space. Approaches mostly focus on single model generalization and accent feature engineering. Among the challenges, lack of a standard benchmark makes research and comparison especially difficult.

* submitted to INTERSPEECH 2021

Via

Access Paper or Ask Questions