Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sebastian Stueker

Efficient Weight factorization for Multilingual Speech Recognition

May 07, 2021

Ngoc-Quan Pham, Tuan-Nam Nguyen, Sebastian Stueker, Alexander Waibel

Figure 1 for Efficient Weight factorization for Multilingual Speech Recognition

Figure 2 for Efficient Weight factorization for Multilingual Speech Recognition

Abstract:End-to-end multilingual speech recognition involves using a single model training on a compositional speech corpus including many languages, resulting in a single neural network to handle transcribing different languages. Due to the fact that each language in the training data has different characteristics, the shared network may struggle to optimize for all various languages simultaneously. In this paper we propose a novel multilingual architecture that targets the core operation in neural networks: linear transformation functions. The key idea of the method is to assign fast weight matrices for each language by decomposing each weight matrix into a shared component and a language dependent component. The latter is then factorized into vectors using rank-1 assumptions to reduce the number of parameters per language. This efficient factorization scheme is proved to be effective in two multilingual settings with $7$ and $27$ languages, reducing the word error rates by $26\%$ and $27\%$ rel. for two popular architectures LSTM and Transformer, respectively.

* Submitted to Interspeech 2021

Via

Access Paper or Ask Questions

Super-Human Performance in Online Low-latency Recognition of Conversational Speech

Oct 22, 2020

Thai-Son Nguyen, Sebastian Stueker, Alex Waibel

Figure 1 for Super-Human Performance in Online Low-latency Recognition of Conversational Speech

Figure 2 for Super-Human Performance in Online Low-latency Recognition of Conversational Speech

Figure 3 for Super-Human Performance in Online Low-latency Recognition of Conversational Speech

Figure 4 for Super-Human Performance in Online Low-latency Recognition of Conversational Speech

Abstract:Achieving super-human performance in recognizing human speech has been a goal for several decades, as researchers have worked on increasingly challenging tasks. In the 1990's it was discovered, that conversational speech between two humans turns out to be considerably more difficult than read speech as hesitations, disfluencies, false starts and sloppy articulation complicate acoustic processing and require robust handling of acoustic, lexical and language context, jointly. Early attempts with statistical models could only reach error rates over 50% and far from human performance (WER of around 5.5%). Neural hybrid models and recent attention-based encoder-decoder models have considerably improved performance as such contexts can now be learned in an integral fashion. However, processing such contexts requires an entire utterance presentation and thus introduces unwanted delays before a recognition result can be output. In this paper, we address performance as well as latency. We present results for a system that can achieve super-human performance (at a WER of 5.0%, over the Switchboard conversational benchmark) at a word based latency of only 1 second behind a speaker's speech. The system uses multiple attention-based encoder-decoder networks integrated within a novel low latency incremental inference approach.

Via

Access Paper or Ask Questions

Relative Positional Encoding for Speech Recognition and Direct Translation

May 20, 2020

Ngoc-Quan Pham, Thanh-Le Ha, Tuan-Nam Nguyen, Thai-Son Nguyen, Elizabeth Salesky, Sebastian Stueker, Jan Niehues, Alexander Waibel

Figure 1 for Relative Positional Encoding for Speech Recognition and Direct Translation

Figure 2 for Relative Positional Encoding for Speech Recognition and Direct Translation

Figure 3 for Relative Positional Encoding for Speech Recognition and Direct Translation

Figure 4 for Relative Positional Encoding for Speech Recognition and Direct Translation

Abstract:Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. As a result, the network can better adapt to the variable distributions present in speech data. Our experiments show that our resulting model achieves the best recognition result on the Switchboard benchmark in the non-augmentation condition, and the best published result in the MuST-C speech translation benchmark. We also show that this model is able to better utilize synthetic data than the Transformer, and adapts better to variable sentence segmentation quality for speech translation.

* Submitted to Interspeech 2020

Via

Access Paper or Ask Questions

High Performance Sequence-to-Sequence Model for Streaming Speech Recognition

Mar 22, 2020

Thai-Son Nguyen, Ngoc-Quan Pham, Sebastian Stueker, Alex Waibel

Figure 1 for High Performance Sequence-to-Sequence Model for Streaming Speech Recognition

Figure 2 for High Performance Sequence-to-Sequence Model for Streaming Speech Recognition

Figure 3 for High Performance Sequence-to-Sequence Model for Streaming Speech Recognition

Figure 4 for High Performance Sequence-to-Sequence Model for Streaming Speech Recognition

Abstract:Recently sequence-to-sequence models have started to achieve state-of-the art performance on standard speech recognition tasks when processing audio data in batch mode, i.e., the complete audio data is available when starting processing. However, when it comes to perform run-on recognition on an input stream of audio data while producing recognition results in real-time and with a low word-based latency, these models face several challenges. For many techniques, the whole audio sequence to be decoded needs to be available at the start of the processing, e.g., for the attention mechanism or for the bidirectional LSTM (BLSTM). In this paper we propose several techniques to mitigate these problems. We introduce an additional loss function controlling the uncertainty of the attention mechanism, a modified beam search identifying partial, stable hypotheses, ways of working with BLSTM in the encoder, and the use of chunked BLSTM. Our experiments show that with the right combination of these techniques it is possible to perform run-on speech recognition with a low word-based latency without sacrificing performance in terms of word error rate.

Via

Access Paper or Ask Questions

Low Latency ASR for Simultaneous Speech Translation

Mar 22, 2020

Thai Son Nguyen, Jan Niehues, Eunah Cho, Thanh-Le Ha, Kevin Kilgour, Markus Muller, Matthias Sperber, Sebastian Stueker, Alex Waibel

Figure 1 for Low Latency ASR for Simultaneous Speech Translation

Figure 2 for Low Latency ASR for Simultaneous Speech Translation

Figure 3 for Low Latency ASR for Simultaneous Speech Translation

Figure 4 for Low Latency ASR for Simultaneous Speech Translation

Abstract:User studies have shown that reducing the latency of our simultaneous lecture translation system should be the most important goal. We therefore have worked on several techniques for reducing the latency for both components, the automatic speech recognition and the speech translation module. Since the commonly used commitment latency is not appropriate in our case of continuous stream decoding, we focused on word latency. We used it to analyze the performance of our current system and to identify opportunities for improvements. In order to minimize the latency we combined run-on decoding with a technique for identifying stable partial hypotheses when stream decoding and a protocol for dynamic output update that allows to revise the most recent parts of the transcription. This combination reduces the latency at word level, where the words are final and will never be updated again in the future, from 18.1s to 1.1s without sacrificing performance in terms of word error rate.

Via

Access Paper or Ask Questions

Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation

Oct 29, 2019

Thai-Son Nguyen, Sebastian Stueker, Jan Niehues, Alex Waibel

Figure 1 for Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation

Figure 2 for Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation

Figure 3 for Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation

Figure 4 for Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation

Abstract:Sequence-to-Sequence (S2S) models recently started to show state-of-the-art performance for automatic speech recognition (ASR). With these large and deep models overfitting remains the largest problem, outweighing performance improvements that can be obtained from better architectures. One solution to the overfitting problem is increasing the amount of available training data and the variety exhibited by the training data with the help of data augmentation. In this paper we examine the influence of three data augmentation methods on the performance of two S2S model architectures. One of the data augmentation method comes from literature, while two other methods are our own development - a time perturbation in the frequency domain and sub-sequence sampling. Our experiments on Switchboard and Fisher data show state-of-the-art performance for S2S models that are trained solely on the speech training data and do not use additional text data.

Via

Access Paper or Ask Questions

Learning Shared Encoding Representation for End-to-End Speech Recognition Models

Mar 31, 2019

Thai-Son Nguyen, Sebastian Stueker, Alex Waibel

Figure 1 for Learning Shared Encoding Representation for End-to-End Speech Recognition Models

Figure 2 for Learning Shared Encoding Representation for End-to-End Speech Recognition Models

Figure 3 for Learning Shared Encoding Representation for End-to-End Speech Recognition Models

Figure 4 for Learning Shared Encoding Representation for End-to-End Speech Recognition Models

Abstract:In this work, we learn a shared encoding representation for a multi-task neural network model optimized with connectionist temporal classification (CTC) and conventional framewise cross-entropy training criteria. Our experiments show that the multi-task training not only tackles the complexity of optimizing CTC models such as acoustic-to-word but also results in significant improvement compared to the plain-task training with an optimal setup. Furthermore, we propose to use the encoding representation learned by the multi-task network to initialize the encoder of attention-based models. Thereby, we train a deep attention-based end-to-end model with 10 long short-term memory (LSTM) layers of encoder which produces 12.2\% and 22.6\% word-error-rate on Switchboard and CallHome subsets of the Hub5 2000 evaluation.

* arXiv admin note: substantial text overlap with arXiv:1902.01951

Via

Access Paper or Ask Questions

Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid models

Feb 02, 2019

Thai-Son Nguyen, Sebastian Stueker, Alex Waibel

Figure 1 for Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid models

Figure 2 for Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid models

Figure 3 for Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid models

Figure 4 for Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid models

Abstract:Acoustic-to-word (A2W) models that allow direct mapping from acoustic signals to word sequences are an appealing approach to end-to-end automatic speech recognition due to their simplicity. However, prior works have shown that modelling A2W typically encounters issues of data sparsity that prevent training such a model directly. So far, pre-training initialization is the only approach proposed to deal with this issue. In this work, we propose to build a shared neural network and optimize A2W and conventional hybrid models in a multi-task manner. Our results show that training an A2W model is much more stable with our multi-task model without pre-training initialization, and results in a significant improvement compared to a baseline model. Experiments also reveal that the performance of a hybrid acoustic model can be further improved when jointly training with a sequence-level optimization criterion such as acoustic-to-word.

Via

Access Paper or Ask Questions

Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

Feb 14, 2018

Odette Scharenborg, Laurent Besacier, Alan Black, Mark Hasegawa-Johnson, Florian Metze, Graham Neubig, Sebastian Stueker, Pierre Godard, Markus Mueller, Lucas Ondel(+9 more)

Figure 1 for Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

Figure 2 for Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

Figure 3 for Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

Abstract:We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsupervised discovery from raw speech.

* Accepted to ICASSP 2018

Via

Access Paper or Ask Questions