Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Arindrima Datta

LSTM Acoustic Models Learn to Align and Pronounce with Graphemes

Aug 13, 2020

Arindrima Datta, Guanlong Zhao, Bhuvana Ramabhadran, Eugene Weinstein

Figure 1 for LSTM Acoustic Models Learn to Align and Pronounce with Graphemes

Figure 2 for LSTM Acoustic Models Learn to Align and Pronounce with Graphemes

Figure 3 for LSTM Acoustic Models Learn to Align and Pronounce with Graphemes

Figure 4 for LSTM Acoustic Models Learn to Align and Pronounce with Graphemes

Abstract:Automated speech recognition coverage of the world's languages continues to expand. However, standard phoneme based systems require handcrafted lexicons that are difficult and expensive to obtain. To address this problem, we propose a training methodology for a grapheme-based speech recognizer that can be trained in a purely data-driven fashion. Built with LSTM networks and trained with the cross-entropy loss, the grapheme-output acoustic models we study are also extremely practical for real-world applications as they can be decoded with conventional ASR stack components such as language models and FST decoders, and produce good quality audio-to-grapheme alignments that are useful in many speech applications. We show that the grapheme models are competitive in WER with their phoneme-output counterparts when trained on large datasets, with the advantage that grapheme models do not require explicit linguistic knowledge as an input. We further compare the alignments generated by the phoneme and grapheme models to demonstrate the quality of the pronunciations learnt by them using four Indian languages that vary linguistically in spoken and written forms.

* 5 pages, 4 figures. This work was done between summer 2018 and spring 2019

Via

Access Paper or Ask Questions

Language-agnostic Multilingual Modeling

Apr 20, 2020

Arindrima Datta, Bhuvana Ramabhadran, Jesse Emond, Anjuli Kannan, Brian Roark

Figure 1 for Language-agnostic Multilingual Modeling

Figure 2 for Language-agnostic Multilingual Modeling

Figure 3 for Language-agnostic Multilingual Modeling

Figure 4 for Language-agnostic Multilingual Modeling

Abstract:Multilingual Automated Speech Recognition (ASR) systems allow for the joint training of data-rich and data-scarce languages in a single model. This enables data and parameter sharing across languages, which is especially beneficial for the data-scarce languages. However, most state-of-the-art multilingual models require the encoding of language information and therefore are not as flexible or scalable when expanding to newer languages. Language-independent multilingual models help to address this issue, and are also better suited for multicultural societies where several languages are frequently used together (but often rendered with different writing systems). In this paper, we propose a new approach to building a language-agnostic multilingual ASR system which transforms all languages to one writing system through a many-to-one transliteration transducer. Thus, similar sounding acoustics are mapped to a single, canonical target sequence of graphemes, effectively separating the modeling and rendering problems. We show with four Indic languages, namely, Hindi, Bengali, Tamil and Kannada, that the language-agnostic multilingual model achieves up to 10% relative reduction in Word Error Rate (WER) over a language-dependent multilingual model.

Via

Access Paper or Ask Questions

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Sep 11, 2019

Anjuli Kannan, Arindrima Datta, Tara N. Sainath, Eugene Weinstein, Bhuvana Ramabhadran, Yonghui Wu, Ankur Bapna, Zhifeng Chen, Seungji Lee

Figure 1 for Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Figure 2 for Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Figure 3 for Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Abstract:Multilingual end-to-end (E2E) models have shown great promise in expansion of automatic speech recognition (ASR) coverage of the world's languages. They have shown improvement over monolingual systems, and have simplified training and serving by eliminating language-specific acoustic, pronunciation, and language models. This work presents an E2E multilingual system which is equipped to operate in low-latency interactive applications, as well as handle a key challenge of real world data: the imbalance in training data across languages. Using nine Indic languages, we compare a variety of techniques, and find that a combination of conditioning on a language vector and training language-specific adapter layers produces the best model. The resulting E2E multilingual model achieves a lower word error rate (WER) than both monolingual E2E models (eight of nine languages) and monolingual conventional systems (all nine languages).

* Accepted in Interspeech 2019

Via

Access Paper or Ask Questions