Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shang-Bao Luo

Speech Recognition by Simply Fine-tuning BERT

Jan 30, 2021

Wen-Chin Huang, Chia-Hua Wu, Shang-Bao Luo, Kuan-Yu Chen, Hsin-Min Wang, Tomoki Toda

Figure 1 for Speech Recognition by Simply Fine-tuning BERT

Figure 2 for Speech Recognition by Simply Fine-tuning BERT

Figure 3 for Speech Recognition by Simply Fine-tuning BERT

Figure 4 for Speech Recognition by Simply Fine-tuning BERT

Abstract:We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is a language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations. Our assumption is that given a history context sequence, a powerful LM can narrow the range of possible choices and the speech signal can be used as a simple clue. Hence, comparing to conventional ASR systems that train a powerful acoustic model (AM) from scratch, we believe that speech recognition is possible by simply fine-tuning a BERT model. As an initial study, we demonstrate the effectiveness of the proposed idea on the AISHELL dataset and show that stacking a very simple AM on top of BERT can yield reasonable performance.

* Accepted to ICASSP 2021

Via

Access Paper or Ask Questions

An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering

May 25, 2020

Chia-Chih Kuo, Shang-Bao Luo, Kuan-Yu Chen

Figure 1 for An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering

Figure 2 for An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering

Figure 3 for An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering

Abstract:In a spoken multiple-choice question answering (SMCQA) task, given a passage, a question, and multiple choices all in the form of speech, the machine needs to pick the correct choice to answer the question. While the audio could contain useful cues for SMCQA, usually only the auto-transcribed text is utilized in system development. Thanks to the large-scaled pre-trained language representation models, such as the bidirectional encoder representations from transformers (BERT), systems with only auto-transcribed text can still achieve a certain level of performance. However, previous studies have evidenced that acoustic-level statistics can offset text inaccuracies caused by the automatic speech recognition systems or representation inadequacy lurking in word embedding generators, thereby making the SMCQA system robust. Along the line of research, this study concentrates on designing a BERT-based SMCQA framework, which not only inherits the advantages of contextualized language representations learned by BERT, but integrates the complementary acoustic-level information distilled from audio with the text-level information. Consequently, an audio-enriched BERT-based SMCQA framework is proposed. A series of experiments demonstrates remarkable improvements in accuracy over selected baselines and SOTA systems on a published Chinese SMCQA dataset.

Via

Access Paper or Ask Questions