Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Sep 30, 2022

Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, Lirong Dai, Jinyu Li(+1 more)

Figure 1 for SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Figure 2 for SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Figure 3 for SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Figure 4 for SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Share this with someone who'll enjoy it:

Abstract:How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. Leveraging only 10K text sentences, our SpeechLM gets a 16\% relative WER reduction over the best base model performance (from 6.8 to 5.7) on the public LibriSpeech ASR benchmark. Moreover, SpeechLM with fewer parameters even outperforms previous SOTA models on CoVoST-2 speech translation tasks. We also evaluate our SpeechLM on various spoken language processing tasks under the universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Our code and models are available at https://aka.ms/SpeechLM.

* 14 pages

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Paper and Code