Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Jul 05, 2024

Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Esaú Villatoro-Tello, Iuliia Nigmatulina, Petr Motlicek, Manjunath K E, Aravind Ganapathiraju

Figure 1 for XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Figure 2 for XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Figure 3 for XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Figure 4 for XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Share this with someone who'll enjoy it:

Abstract:Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data for training. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch.To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER.

* 5 pages, double column

View paper on

Share this with someone who'll enjoy it:

Title:XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Paper and Code