Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition

Jan 24, 2021

Cheng Yi, Shiyu Zhou, Bo Xu

Figure 1 for Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition

Figure 2 for Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition

Figure 3 for Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition

Figure 4 for Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition

Share this with someone who'll enjoy it:

Abstract:End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy the demand of end-to-end models. Self-supervised acoustic pre-training has already shown its amazing ASR performance, while the transcription is still inadequate for language modeling in end-to-end models. In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused model only needs to learn the transfer from speech to language during fine-tuning on limited labeled data. The length of the two modalities is matched by a monotonic attention mechanism without additional parameters. Besides, a fully connected layer is introduced for the hidden mapping between modalities. We further propose a scheduled fine-tuning strategy to preserve and utilize the text context modeling ability of the pre-trained linguistic encoder. Experiments show our effective utilizing of pre-trained modules. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.

View paper on

Share this with someone who'll enjoy it:

Title:Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition

Paper and Code