Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Albert Zeyer

Dynamic Acoustic Model Architecture Optimization in Training for ASR

Jun 16, 2025

Jingjing Xu, Zijian Yang, Albert Zeyer, Eugen Beck, Ralf Schlueter, Hermann Ney

Abstract:Architecture design is inherently complex. Existing approaches rely on either handcrafted rules, which demand extensive empirical expertise, or automated methods like neural architecture search, which are computationally intensive. In this paper, we introduce DMAO, an architecture optimization framework that employs a grow-and-drop strategy to automatically reallocate parameters during training. This reallocation shifts resources from less-utilized areas to those parts of the model where they are most beneficial. Notably, DMAO only introduces negligible training overhead at a given model complexity. We evaluate DMAO through experiments with CTC on LibriSpeech, TED-LIUM-v2 and Switchboard datasets. The results show that, using the same amount of training resources, our proposed DMAO consistently improves WER by up to 6% relatively across various architectures, model sizes, and datasets. Furthermore, we analyze the pattern of parameter redistribution and uncover insightful findings.

Via

Access Paper or Ask Questions

The Conformer Encoder May Reverse the Time Dimension

Oct 01, 2024

Robin Schmitt, Albert Zeyer, Mohammad Zeineldeen, Ralf Schlüter, Hermann Ney

Abstract:We sometimes observe monotonically decreasing cross-attention weights in our Conformer-based global attention-based encoder-decoder (AED) models. Further investigation shows that the Conformer encoder internally reverses the sequence in the time dimension. We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames. Furthermore, we show that, at some point in training, the self-attention module of the Conformer starts dominating the output over the preceding feed-forward module, which then only allows the reversed information to pass through. We propose several methods and ideas of how this flipping can be avoided. Additionally, we investigate a novel method to obtain label-frame-position alignments by using the gradients of the label log probabilities w.r.t. the encoder input frames.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

Sep 15, 2023

Mohammad Zeineldeen, Albert Zeyer, Ralf Schlüter, Hermann Ney

Figure 1 for Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

Figure 2 for Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

Figure 3 for Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

Figure 4 for Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

Abstract:We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transducer model that operates on chunks instead of frames, where EOC corresponds to the blank symbol. We further explore the remaining differences between a standard transducer and our model. Additionally, we examine relevant aspects such as long-form speech generalization, beam size, and length normalization. Through experiments on Librispeech and TED-LIUM-v2, and by concatenating consecutive sequences for long-form trials, we find that our streamable model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Monotonic segmental attention for automatic speech recognition

Oct 26, 2022

Albert Zeyer, Robin Schmitt, Wei Zhou, Ralf Schlüter, Hermann Ney

Abstract:We introduce a novel segmental-attention model for automatic speech recognition. We restrict the decoder attention to segments to avoid quadratic runtime of global attention, better generalize to long sequences, and eventually enable streaming. We directly compare global-attention and different segmental-attention modeling variants. We develop and compare two separate time-synchronous decoders, one specifically taking the segmental nature into account, yielding further improvements. Using time-synchronous decoding for segmental models is novel and a step towards streaming applications. Our experiments show the importance of a length model to predict the segment boundaries. The final best segmental-attention model using segmental decoding performs better than global-attention, in contrast to other monotonic attention approaches in the literature. Further, we observe that the segmental model generalizes much better to long sequences of up to several minutes.

* accepted at SLT: https://slt2022.org/

Via

Access Paper or Ask Questions

Why does CTC result in peaky behavior?

Jun 03, 2021

Albert Zeyer, Ralf Schlüter, Hermann Ney

Figure 1 for Why does CTC result in peaky behavior?

Figure 2 for Why does CTC result in peaky behavior?

Figure 3 for Why does CTC result in peaky behavior?

Figure 4 for Why does CTC result in peaky behavior?

Abstract:The peaky behavior of CTC models is well known experimentally. However, an understanding about why peaky behavior occurs is missing, and whether this is a good property. We provide a formal analysis of the peaky behavior and gradient descent convergence properties of the CTC loss and related training criteria. Our analysis provides a deep understanding why peaky behavior occurs and when it is suboptimal. On a simple example which should be trivial to learn for any model, we prove that a feed-forward neural network trained with CTC from uniform initialization converges towards peaky behavior with a 100% error rate. Our analysis further explains why CTC only works well together with the blank label. We further demonstrate that peaky behavior does not occur on other related losses including a label prior model, and that this improves convergence.

Via

Access Paper or Ask Questions

Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept

Apr 13, 2021

Wei Zhou, Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney

Figure 1 for Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept

Figure 2 for Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept

Abstract:With the advent of direct models in automatic speech recognition (ASR), the formerly prevalent frame-wise acoustic modeling based on hidden Markov models (HMM) diversified into a number of modeling architectures like encoder-decoder attention models, transducer models and segmental models (direct HMM). While transducer models stay with a frame-level model definition, segmental models are defined on the level of label segments, directly. While (soft-)attention-based models avoid explicit alignment, transducer and segmental approach internally do model alignment, either by segment hypotheses or, more implicitly, by emitting so-called blank symbols. In this work, we prove that the widely used class of RNN-Transducer models and segmental models (direct HMM) are equivalent and therefore show equal modeling power. It is shown that blank probabilities translate into segment length probabilities and vice versa. In addition, we provide initial experiments investigating decoding and beam-pruning, comparing time-synchronous and label-/segment-synchronous search strategies and their properties using the same underlying model.

* submitted to Interspeech2021

Via

Access Paper or Ask Questions

Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models

Apr 12, 2021

Mohammad Zeineldeen, Aleksandr Glushko, Wilfried Michel, Albert Zeyer, Ralf Schlüter, Hermann Ney

Figure 1 for Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models

Figure 2 for Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models

Figure 3 for Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models

Figure 4 for Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models

Abstract:Attention-based encoder-decoder (AED) models learn an implicit internal language model (ILM) from the training transcriptions. The integration with an external LM trained on much more unpaired text usually leads to better performance. A Bayesian interpretation as in the hybrid autoregressive transducer (HAT) suggests dividing by the prior of the discriminative acoustic model, which corresponds to this implicit LM, similarly as in the hybrid hidden Markov model approach. The implicit LM cannot be calculated efficiently in general and it is yet unclear what are the best methods to estimate it. In this work, we compare different approaches from the literature and propose several novel methods to estimate the ILM directly from the AED model. Our proposed methods outperform all previous approaches. We also investigate other methods to suppress the ILM mainly by decreasing the capacity of the AED model, limiting the label context, and also by training the AED model together with a pre-existing LM.

* submitted to Interspeech 2021

Via

Access Paper or Ask Questions

Librispeech Transducer Model with Internal Language Model Prior Correction

Apr 07, 2021

Albert Zeyer, André Merboldt, Wilfried Michel, Ralf Schlüter, Hermann Ney

Figure 1 for Librispeech Transducer Model with Internal Language Model Prior Correction

Figure 2 for Librispeech Transducer Model with Internal Language Model Prior Correction

Figure 3 for Librispeech Transducer Model with Internal Language Model Prior Correction

Figure 4 for Librispeech Transducer Model with Internal Language Model Prior Correction

Abstract:We present our transducer model on Librispeech. We study variants to include an external language model (LM) with shallow fusion and subtract an estimated internal LM. This is justified by a Bayesian interpretation where the transducer model prior is given by the estimated internal LM. The subtraction of the internal LM gives us over 14% relative improvement over normal shallow fusion. Our transducer has a separate probability distribution for the non-blank labels which allows for easier combination with the external LM, and easier estimation of the internal LM. We additionally take care of including the end-of-sentence (EOS) probability of the external LM in the last blank probability which further improves the performance. All our code and setups are published.

* submitted to Interspeech 2021

Via

Access Paper or Ask Questions

A study of latent monotonic attention variants

Mar 30, 2021

Albert Zeyer, Ralf Schlüter, Hermann Ney

Figure 1 for A study of latent monotonic attention variants

Figure 2 for A study of latent monotonic attention variants

Figure 3 for A study of latent monotonic attention variants

Figure 4 for A study of latent monotonic attention variants

Abstract:End-to-end models reach state-of-the-art performance for speech recognition, but global soft attention is not monotonic, which might lead to convergence problems, to instability, to bad generalisation, cannot be used for online streaming, and is also inefficient in calculation. Monotonicity can potentially fix all of this. There are several ad-hoc solutions or heuristics to introduce monotonicity, but a principled introduction is rarely found in literature so far. In this paper, we present a mathematically clean solution to introduce monotonicity, by introducing a new latent variable which represents the audio position or segment boundaries. We compare several monotonic latent models to our global soft attention baseline such as a hard attention model, a local windowed soft attention model, and a segmental soft attention model. We can show that our monotonic models perform as good as the global soft attention model. We perform our experiments on Switchboard 300h. We carefully outline the details of our training and release our code and configs.

Via

Access Paper or Ask Questions

Investigations on Phoneme-Based End-To-End Speech Recognition

May 19, 2020

Albert Zeyer, Wei Zhou, Thomas Ng, Ralf Schlüter, Hermann Ney

Figure 1 for Investigations on Phoneme-Based End-To-End Speech Recognition

Figure 2 for Investigations on Phoneme-Based End-To-End Speech Recognition

Figure 3 for Investigations on Phoneme-Based End-To-End Speech Recognition

Figure 4 for Investigations on Phoneme-Based End-To-End Speech Recognition

Abstract:Common end-to-end models like CTC or encoder-decoder-attention models use characters or subword units like BPE as the output labels. We do systematic comparisons between grapheme-based and phoneme-based output labels. These can be single phonemes without context (~40 labels), or multiple phonemes together in one output label, such that we get phoneme-based subwords. For this purpose, we introduce phoneme-based BPE labels. In further experiments, we extend the phoneme set by auxiliary units to be able to discriminate homophones (different words with same pronunciation). This enables a very simple and efficient decoding algorithm. We perform the experiments on Switchboard 300h and we can show that our phoneme-based models are competitive to the grapheme-based models.

* submission to Interspeech 2020

Via

Access Paper or Ask Questions