Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

F. de Wet

Semi-supervised acoustic modelling for five-lingual code-switched ASR using automatically-segmented soap opera speech

Apr 08, 2020

N. Wilkinson, A. Biswas, E. Yılmaz, F. de Wet, E. van der Westhuizen, T. R. Niesler

Figure 1 for Semi-supervised acoustic modelling for five-lingual code-switched ASR using automatically-segmented soap opera speech

Figure 2 for Semi-supervised acoustic modelling for five-lingual code-switched ASR using automatically-segmented soap opera speech

Figure 3 for Semi-supervised acoustic modelling for five-lingual code-switched ASR using automatically-segmented soap opera speech

Figure 4 for Semi-supervised acoustic modelling for five-lingual code-switched ASR using automatically-segmented soap opera speech

Abstract:This paper considers the impact of automatic segmentation on the fully-automatic, semi-supervised training of automatic speech recognition (ASR) systems for five-lingual code-switched (CS) speech. Four automatic segmentation techniques were evaluated in terms of the recognition performance of an ASR system trained on the resulting segments in a semi-supervised manner. The system's output was compared with the recognition rates achieved by a semi-supervised system trained on manually assigned segments. Three of the automatic techniques use a newly proposed convolutional neural network (CNN) model for framewise classification, and include a novel form of HMM smoothing of the CNN outputs. Automatic segmentation was applied in combination with automatic speaker diarization. The best-performing segmentation technique was also tested without speaker diarization. An evaluation based on 248 unsegmented soap opera episodes indicated that voice activity detection (VAD) based on a CNN followed by Gaussian mixture modelhidden Markov model smoothing (CNN-GMM-HMM) yields the best ASR performance. The semi-supervised system trained with the resulting segments achieved an overall WER improvement of 1.1% absolute over the system trained with manually created segments. Furthermore, we found that system performance improved even further when the automatic segmentation was used in conjunction with speaker diarization.

* SLTU 2020. arXiv admin note: text overlap with arXiv:2003.03135

Via

Access Paper or Ask Questions

Semi-supervised acoustic and language model training for English-isiZulu code-switched speech recognition

Apr 05, 2020

A. Biswas, F. de Wet, E. van der Westhuizen, T. R. Niesler

Figure 1 for Semi-supervised acoustic and language model training for English-isiZulu code-switched speech recognition

Figure 2 for Semi-supervised acoustic and language model training for English-isiZulu code-switched speech recognition

Figure 3 for Semi-supervised acoustic and language model training for English-isiZulu code-switched speech recognition

Figure 4 for Semi-supervised acoustic and language model training for English-isiZulu code-switched speech recognition

Abstract:We present an analysis of semi-supervised acoustic and language model training for English-isiZulu code-switched ASR using soap opera speech. Approximately 11 hours of untranscribed multilingual speech was transcribed automatically using four bilingual code-switching transcription systems operating in English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho. These transcriptions were incorporated into the acoustic and language model training sets. Results showed that the TDNN-F acoustic models benefit from the additional semi-supervised data and that even better performance could be achieved by including additional CNN layers. Using these CNN-TDNN-F acoustic models, a first iteration of semi-supervised training achieved an absolute mixed-language WER reduction of 3.4%, and a further 2.2% after a second iteration. Although the languages in the untranscribed data were unknown, the best results were obtained when all automatically transcribed data was used for training and not just the utterances classified as English-isiZulu. Despite reducing perplexity, the semi-supervised language model was not able to improve the ASR performance.

* 4th Code-Switch workshop, France

Via

Access Paper or Ask Questions