Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Astik Biswas

Multilingual Bottleneck Features for Improving ASR Performance of Code-Switched Speech in Under-Resourced Languages

Oct 31, 2020

Trideba Padhi, Astik Biswas, Febe De Wet, Ewald van der Westhuizen, Thomas Niesler

Figure 1 for Multilingual Bottleneck Features for Improving ASR Performance of Code-Switched Speech in Under-Resourced Languages

Figure 2 for Multilingual Bottleneck Features for Improving ASR Performance of Code-Switched Speech in Under-Resourced Languages

Figure 3 for Multilingual Bottleneck Features for Improving ASR Performance of Code-Switched Speech in Under-Resourced Languages

Figure 4 for Multilingual Bottleneck Features for Improving ASR Performance of Code-Switched Speech in Under-Resourced Languages

Abstract:In this work, we explore the benefits of using multilingual bottleneck features (mBNF) in acoustic modelling for the automatic speech recognition of code-switched (CS) speech in African languages. The unavailability of annotated corpora in the languages of interest has always been a primary challenge when developing speech recognition systems for this severely under-resourced type of speech. Hence, it is worthwhile to investigate the potential of using speech corpora available for other better-resourced languages to improve speech recognition performance. To achieve this, we train a mBNF extractor using nine Southern Bantu languages that form part of the freely available multilingual NCHLT corpus. We append these mBNFs to the existing MFCCs, pitch features and i-vectors to train acoustic models for automatic speech recognition (ASR) in the target code-switched languages. Our results show that the inclusion of the mBNF features leads to clear performance improvements over a baseline trained without the mBNFs for code-switched English-isiZulu, English-isiXhosa, English-Sesotho and English-Setswana speech.

* http://festvox.org/cedar/WSTCSMC2020.pdf
* In Proceedings of The First Workshop on Speech Technologies for Code-Switching in Multilingual Communities

Via

Access Paper or Ask Questions

Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages

Mar 06, 2020

Astik Biswas, Emre Yılmaz, Febe de Wet, Ewald van der Westhuizen, Thomas Niesler

Figure 1 for Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages

Figure 2 for Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages

Figure 3 for Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages

Figure 4 for Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages

Abstract:This paper reports on the semi-supervised development of acoustic and language models for under-resourced, code-switched speech in five South African languages. Two approaches are considered. The first constructs four separate bilingual automatic speech recognisers (ASRs) corresponding to four different language pairs between which speakers switch frequently. The second uses a single, unified, five-lingual ASR system that represents all the languages (English, isiZulu, isiXhosa, Setswana and Sesotho). We evaluate the effectiveness of these two approaches when used to add additional data to our extremely sparse training sets. Results indicate that batch-wise semi-supervised training yields better results than a non-batch-wise approach. Furthermore, while the separate bilingual systems achieved better recognition performance than the unified system, they benefited more from pseudo-labels generated by the five-lingual system than from those generated by the bilingual systems.

* Conference

Via

Access Paper or Ask Questions

Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training

Jul 06, 2019

Astik Biswas, Raghav Menon, Ewald van der Westhuizen, Thomas Niesler

Figure 1 for Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training

Figure 2 for Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training

Figure 3 for Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training

Figure 4 for Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training

Abstract:We present improvements in automatic speech recognition (ASR) for Somali, a currently extremely under-resourced language. This forms part of a continuing United Nations (UN) effort to employ ASR-based keyword spotting systems to support humanitarian relief programmes in rural Africa. Using just 1.57 hours of annotated speech data as a seed corpus, we increase the pool of training data by applying semi-supervised training to 17.55 hours of untranscribed speech. We make use of factorised time-delay neural networks (TDNN-F) for acoustic modelling, since these have recently been shown to be effective in resource-scarce situations. Three semi-supervised training passes were performed, where the decoded output from each pass was used for acoustic model training in the subsequent pass. The automatic transcriptions from the best performing pass were used for language model augmentation. To ensure the quality of automatic transcriptions, decoder confidence is used as a threshold. The acoustic and language models obtained from the semi-supervised approach show significant improvement in terms of WER and perplexity compared to the baseline. Incorporating the automatically generated transcriptions yields a 6.55\% improvement in language model perplexity. The use of 17.55 hour of Somali acoustic data in semi-supervised training shows an improvement of 7.74\% relative over the baseline.

* 5 pages, 6 Tables, 3 figures, 22 references (Accepted at Interspeech 2019)

Via

Access Paper or Ask Questions

Semi-supervised acoustic model training for five-lingual code-switched ASR

Jun 20, 2019

Astik Biswas, Emre Yılmaz, Febe de Wet, Ewald van der Westhuizen, Thomas Niesler

Figure 1 for Semi-supervised acoustic model training for five-lingual code-switched ASR

Figure 2 for Semi-supervised acoustic model training for five-lingual code-switched ASR

Figure 3 for Semi-supervised acoustic model training for five-lingual code-switched ASR

Figure 4 for Semi-supervised acoustic model training for five-lingual code-switched ASR

Abstract:This paper presents recent progress in the acoustic modelling of under-resourced code-switched (CS) speech in multiple South African languages. We consider two approaches. The first constructs separate bilingual acoustic models corresponding to language pairs (English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho). The second constructs a single unified five-lingual acoustic model representing all the languages (English, isiZulu, isiXhosa, Setswana and Sesotho). For these two approaches we consider the effectiveness of semi-supervised training to increase the size of the very sparse acoustic training sets. Using approximately 11 hours of untranscribed speech, we show that both approaches benefit from semi-supervised training. The bilingual TDNN-F acoustic models also benefit from the addition of CNN layers (CNN-TDNN-F), while the five-lingual system does not show any significant improvement. Furthermore, because English is common to all language pairs in our data, it dominates when training a unified language model, leading to improved English ASR performance at the expense of the other languages. Nevertheless, the five-lingual model offers flexibility because it can process more than two languages simultaneously, and is therefore an attractive option as an automatic transcription system in a semi-supervised training pipeline.

* Accepted for publication at Interspeech 2019

Via

Access Paper or Ask Questions

Building a Unified Code-Switching ASR System for South African Languages

Jul 28, 2018

Emre Yılmaz, Astik Biswas, Ewald van der Westhuizen, Febe de Wet, Thomas Niesler

Figure 1 for Building a Unified Code-Switching ASR System for South African Languages

Figure 2 for Building a Unified Code-Switching ASR System for South African Languages

Figure 3 for Building a Unified Code-Switching ASR System for South African Languages

Figure 4 for Building a Unified Code-Switching ASR System for South African Languages

Abstract:We present our first efforts towards building a single multilingual automatic speech recognition (ASR) system that can process code-switching (CS) speech in five languages spoken within the same population. This contrasts with related prior work which focuses on the recognition of CS speech in bilingual scenarios. Recently, we have compiled a small five-language corpus of South African soap opera speech which contains examples of CS between 5 languages occurring in various contexts such as using English as the matrix language and switching to other indigenous languages. The ASR system presented in this work is trained on 4 corpora containing English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho CS speech. The interpolation of multiple language models trained on these language pairs enables the ASR system to hypothesize mixed word sequences from these 5 languages. We evaluate various state-of-the-art acoustic models trained on this 5-lingual training data and report ASR accuracy and language recognition performance on the development and test sets of the South African multilingual soap opera corpus.

* Acccepted for publication at Interspeech 2018

Via

Access Paper or Ask Questions

Automatic Speech Recognition for Humanitarian Applications in Somali

Jul 23, 2018

Raghav Menon, Astik Biswas, Armin Saeb, John Quinn, Thomas Niesler

Figure 1 for Automatic Speech Recognition for Humanitarian Applications in Somali

Figure 2 for Automatic Speech Recognition for Humanitarian Applications in Somali

Figure 3 for Automatic Speech Recognition for Humanitarian Applications in Somali

Figure 4 for Automatic Speech Recognition for Humanitarian Applications in Somali

Abstract:We present our first efforts in building an automatic speech recognition system for Somali, an under-resourced language, using 1.57 hrs of annotated speech for acoustic model training. The system is part of an ongoing effort by the United Nations (UN) to implement keyword spotting systems supporting humanitarian relief programmes in parts of Africa where languages are severely under-resourced. We evaluate several types of acoustic model, including recent neural architectures. Language model data augmentation using a combination of recurrent neural networks (RNN) and long short-term memory neural networks (LSTMs) as well as the perturbation of acoustic data are also considered. We find that both types of data augmentation are beneficial to performance, with our best system using a combination of convolutional neural networks (CNNs), time-delay neural networks (TDNNs) and bi-directional long short term memory (BLSTMs) to achieve a word error rate of 53.75%.

* 5 pages, 3 figures, 5 tables accepted at SLTU 2018

Via

Access Paper or Ask Questions

Spoken Language Identification Using Hybrid Feature Extraction Methods

Mar 29, 2010

Pawan Kumar, Astik Biswas, A . N. Mishra, Mahesh Chandra

Figure 1 for Spoken Language Identification Using Hybrid Feature Extraction Methods

Figure 2 for Spoken Language Identification Using Hybrid Feature Extraction Methods

Figure 3 for Spoken Language Identification Using Hybrid Feature Extraction Methods

Figure 4 for Spoken Language Identification Using Hybrid Feature Extraction Methods

Abstract:This paper introduces and motivates the use of hybrid robust feature extraction technique for spoken language identification (LID) system. The speech recognizers use a parametric form of a signal to get the most important distinguishable features of speech signal for recognition task. In this paper Mel-frequency cepstral coefficients (MFCC), Perceptual linear prediction coefficients (PLP) along with two hybrid features are used for language Identification. Two hybrid features, Bark Frequency Cepstral Coefficients (BFCC) and Revised Perceptual Linear Prediction Coefficients (RPLP) were obtained from combination of MFCC and PLP. Two different classifiers, Vector Quantization (VQ) with Dynamic Time Warping (DTW) and Gaussian Mixture Model (GMM) were used for classification. The experiment shows better identification rate using hybrid feature extraction techniques compared to conventional feature extraction methods.BFCC has shown better performance than MFCC with both classifiers. RPLP along with GMM has shown best identification performance among all feature extraction techniques.

* Journal of Telecommunications, Volume 1, Issue 2, pp11-15, March 2010

Via

Access Paper or Ask Questions