Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Aug 09, 2023

Yang Zhang, Krishna C. Puvvada, Vitaly Lavrukhin, Boris Ginsburg

Figure 1 for Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Figure 2 for Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Figure 3 for Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Figure 4 for Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Share this with someone who'll enjoy it:

Abstract:We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). The model consists of a TitaNet based speaker embedding module, a Conformer based masking as well as ASR modules. These modules are jointly optimized to transcribe a target-speaker, while ignoring speech from other speakers. For training we use Connectionist Temporal Classification (CTC) loss and introduce a scale-invariant spectrogram reconstruction loss to encourage the model better separate the target-speaker's spectrogram from mixture. We obtain state-of-the-art target-speaker word error rate (TS-WER) on WSJ0-2mix-extr (4.2%). Further, we report for the first time TS-WER on WSJ0-3mix-extr (12.4%), LibriSpeech2Mix (4.2%) and LibriSpeech3Mix (7.6%) datasets, establishing new benchmarks for TS-ASR. The proposed model will be open-sourced through NVIDIA NeMo toolkit.

View paper on

Share this with someone who'll enjoy it:

Title:Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Paper and Code