Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sankaran Panchapagesan

A Conformer-based Waveform-domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy

May 06, 2022

Sankaran Panchapagesan, Arun Narayanan, Turaj Zakizadeh Shabestary, Shuai Shao, Nathan Howard, Alex Park, James Walker, Alexander Gruenstein

Figure 1 for A Conformer-based Waveform-domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy

Figure 2 for A Conformer-based Waveform-domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy

Figure 3 for A Conformer-based Waveform-domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy

Figure 4 for A Conformer-based Waveform-domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy

Abstract:Acoustic Echo Cancellation (AEC) is essential for accurate recognition of queries spoken to a smart speaker that is playing out audio. Previous work has shown that a neural AEC model operating on log-mel spectral features (denoted "logmel" hereafter) can greatly improve Automatic Speech Recognition (ASR) accuracy when optimized with an auxiliary loss utilizing a pre-trained ASR model encoder. In this paper, we develop a conformer-based waveform-domain neural AEC model inspired by the "TasNet" architecture. The model is trained by jointly optimizing Negative Scale-Invariant SNR (SISNR) and ASR losses on a large speech dataset. On a realistic rerecorded test set, we find that cascading a linear adaptive AEC and a waveform-domain neural AEC is very effective, giving 56-59% word error rate (WER) reduction over the linear AEC alone. On this test set, the 1.6M parameter waveform-domain neural AEC also improves over a larger 6.5M parameter logmel-domain neural AEC model by 20-29% in easy to moderate conditions. By operating on smaller frames, the waveform neural model is able to perform better at smaller sizes and is better suited for applications where memory is limited.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Mask scalar prediction for improving robust automatic speech recognition

Apr 26, 2022

Arun Narayanan, James Walker, Sankaran Panchapagesan, Nathan Howard, Yuma Koizumi

Figure 1 for Mask scalar prediction for improving robust automatic speech recognition

Figure 2 for Mask scalar prediction for improving robust automatic speech recognition

Figure 3 for Mask scalar prediction for improving robust automatic speech recognition

Figure 4 for Mask scalar prediction for improving robust automatic speech recognition

Abstract:Using neural network based acoustic frontends for improving robustness of streaming automatic speech recognition (ASR) systems is challenging because of the causality constraints and the resulting distortion that the frontend processing introduces in speech. Time-frequency masking based approaches have been shown to work well, but they need additional hyper-parameters to scale the mask to limit speech distortion. Such mask scalars are typically hand-tuned and chosen conservatively. In this work, we present a technique to predict mask scalars using an ASR-based loss in an end-to-end fashion, with minimal increase in the overall model size and complexity. We evaluate the approach on two robust ASR tasks: multichannel enhancement in the presence of speech and non-speech noise, and acoustic echo cancellation (AEC). Results show that the presented algorithm consistently improves word error rate (WER) without the need for any additional tuning over strong baselines that use hand-tuned hyper-parameters: up to 16% for multichannel enhancement in noisy conditions, and up to 7% for AEC.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

SNRi Target Training for Joint Speech Enhancement and Recognition

Nov 01, 2021

Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani

Figure 1 for SNRi Target Training for Joint Speech Enhancement and Recognition

Figure 2 for SNRi Target Training for Joint Speech Enhancement and Recognition

Figure 3 for SNRi Target Training for Joint Speech Enhancement and Recognition

Figure 4 for SNRi Target Training for Joint Speech Enhancement and Recognition

Abstract:This study aims to improve the performance of automatic speech recognition (ASR) under noisy conditions. The use of a speech enhancement (SE) frontend has been widely studied for noise robust ASR. However, most single-channel SE models introduce processing artifacts in the enhanced speech resulting in degraded ASR performance. To overcome this problem, we propose Signal-to-Noise Ratio improvement (SNRi) target training; the SE frontend automatically controls its noise reduction level to avoid degrading the ASR performance due to artifacts. The SE frontend uses an auxiliary scalar input which represents the target SNRi of the output signal. The target SNRi value is estimated by the SNRi prediction network, which is trained to minimize the ASR loss. Experiments using 55,027 hours of noisy speech training data show that SNRi target training enables control of the SNRi of the output signal, and the joint training reduces word error rate by 12% compared to a state-of-the-art Conformer-based ASR model.

* Submitted to ICASSP 2022

Via

Access Paper or Ask Questions

Data Augmentation for Robust Keyword Spotting under Playback Interference

Aug 01, 2018

Anirudh Raju, Sankaran Panchapagesan, Xing Liu, Arindam Mandal, Nikko Strom

Figure 1 for Data Augmentation for Robust Keyword Spotting under Playback Interference

Figure 2 for Data Augmentation for Robust Keyword Spotting under Playback Interference

Figure 3 for Data Augmentation for Robust Keyword Spotting under Playback Interference

Figure 4 for Data Augmentation for Robust Keyword Spotting under Playback Interference

Abstract:Accurate on-device keyword spotting (KWS) with low false accept and false reject rate is crucial to customer experience for far-field voice control of conversational agents. It is particularly challenging to maintain low false reject rate in real world conditions where there is (a) ambient noise from external sources such as TV, household appliances, or other speech that is not directed at the device (b) imperfect cancellation of the audio playback from the device, resulting in residual echo, after being processed by the Acoustic Echo Cancellation (AEC) system. In this paper, we propose a data augmentation strategy to improve keyword spotting performance under these challenging conditions. The training set audio is artificially corrupted by mixing in music and TV/movie audio, at different signal to interference ratios. Our results show that we get around 30-45% relative reduction in false reject rates, at a range of false alarm rates, under audio playback from such devices.

Via

Access Paper or Ask Questions

Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting

May 05, 2017

Ming Sun, Anirudh Raju, George Tucker, Sankaran Panchapagesan, Gengshen Fu, Arindam Mandal, Spyros Matsoukas, Nikko Strom, Shiv Vitaladevuni

Figure 1 for Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting

Figure 2 for Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting

Figure 3 for Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting

Figure 4 for Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting

Abstract:We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields $67.6\%$ relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.

* Spoken Language Technology Workshop (SLT), 2016 IEEE (pp. 474-480). IEEE

Via

Access Paper or Ask Questions