Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhaofeng Lin

Uncovering the Visual Contribution in Audio-Visual Speech Recognition

Dec 22, 2024

Zhaofeng Lin, Naomi Harte

Abstract:Audio-Visual Speech Recognition (AVSR) combines auditory and visual speech cues to enhance the accuracy and robustness of speech recognition systems. Recent advancements in AVSR have improved performance in noisy environments compared to audio-only counterparts. However, the true extent of the visual contribution, and whether AVSR systems fully exploit the available cues in the visual domain, remains unclear. This paper assesses AVSR systems from a different perspective, by considering human speech perception. We use three systems: Auto-AVSR, AVEC and AV-RelScore. We first quantify the visual contribution using effective SNR gains at 0 dB and then investigate the use of visual information in terms of its temporal distribution and word-level informativeness. We show that low WER does not guarantee high SNR gains. Our results suggest that current methods do not fully exploit visual information, and we recommend future research to report effective SNR gains alongside WERs.

* 5 pages, 2 figures. Accepted to ICASSP 2025

Via

Access Paper or Ask Questions

Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation

Nov 09, 2023

Zhaofeng Lin, Tanvina Patel, Odette Scharenborg

Figure 1 for Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation

Figure 2 for Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation

Figure 3 for Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation

Figure 4 for Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation

Abstract:Whispering is a distinct form of speech known for its soft, breathy, and hushed characteristics, often used for private communication. The acoustic characteristics of whispered speech differ substantially from normally phonated speech and the scarcity of adequate training data leads to low automatic speech recognition (ASR) performance. To address the data scarcity issue, we use a signal processing-based technique that transforms the spectral characteristics of normal speech to those of pseudo-whispered speech. We augment an End-to-End ASR with pseudo-whispered speech and achieve an 18.2% relative reduction in word error rate for whispered speech compared to the baseline. Results for the individual speaker groups in the wTIMIT database show the best results for US English. Further investigation showed that the lack of glottal information in whispered speech has the largest impact on whispered speech ASR performance.

* Accepted to ASRU 2023

Via

Access Paper or Ask Questions