Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shuo-Yiin Chang

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

Jan 23, 2024

W. Ronny Huang, Cyril Allauzen, Tongzhou Chen, Kilol Gupta, Ke Hu, James Qin, Yu Zhang, Yongqiang Wang, Shuo-Yiin Chang, Tara N. Sainath

Figure 1 for Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

Figure 2 for Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

Figure 3 for Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

Figure 4 for Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

Abstract:In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.6% on YouTube captioning. Furthermore, our comprehensive ablation study analyzes key parameters such as LLM size, context length, vocabulary size, fusion methodology. For instance, we explore the impact of LLM size ranging from 128M to 340B parameters on ASR performance. This study provides valuable insights into the factors influencing the effectiveness of practical large-scale LM-fused speech recognition systems.

* ICASSP 2024

Via

Access Paper or Ask Questions

E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Nov 28, 2022

W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman

Figure 1 for E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Figure 2 for E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Figure 3 for E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Figure 4 for E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Abstract:We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a novel dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a real-world long-form captioning task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over a baseline VAD-based segmenter with the same cascaded encoder.

Via

Access Paper or Ask Questions

On Neural Phone Recognition of Mixed-Source ECoG Signals

Dec 12, 2019

Ahmed Hussen Abdelaziz, Shuo-Yiin Chang, Nelson Morgan, Erik Edwards, Dorothea Kolossa, Dan Ellis, David A. Moses, Edward F. Chang

Figure 1 for On Neural Phone Recognition of Mixed-Source ECoG Signals

Figure 2 for On Neural Phone Recognition of Mixed-Source ECoG Signals

Figure 3 for On Neural Phone Recognition of Mixed-Source ECoG Signals

Figure 4 for On Neural Phone Recognition of Mixed-Source ECoG Signals

Abstract:The emerging field of neural speech recognition (NSR) using electrocorticography has recently attracted remarkable research interest for studying how human brains recognize speech in quiet and noisy surroundings. In this study, we demonstrate the utility of NSR systems to objectively prove the ability of human beings to attend to a single speech source while suppressing the interfering signals in a simulated cocktail party scenario. The experimental results show that the relative degradation of the NSR system performance when tested in a mixed-source scenario is significantly lower than that of automatic speech recognition (ASR). In this paper, we have significantly enhanced the performance of our recently published framework by using manual alignments for initialization instead of the flat start technique. We have also improved the NSR system performance by accounting for the possible transcription mismatch between the acoustic and neural signals.

* 5 pages, showing algorithms, results and references from our collaboration during a 2017 postdoc stay of the first author

Via

Access Paper or Ask Questions