Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tanel Alumäe

Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs

Jan 09, 2025

Artem Fedorchenko, Tanel Alumäe

Abstract:This paper presents an approach for generating high-quality, same-language subtitles for Estonian TV content. We fine-tune the Whisper model on human-generated Estonian subtitles and enhance it with iterative pseudo-labeling and large language model (LLM) based post-editing. Our experiments demonstrate notable subtitle quality improvement through pseudo-labeling with an unlabeled dataset. We find that applying LLM-based editing at test time enhances subtitle accuracy, while its use during training does not yield further gains. This approach holds promise for creating subtitle quality close to human standard and could be extended to real-time applications.

Via

Access Paper or Ask Questions

TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024

Jul 17, 2024

Joonas Kalda, Tanel Alumäe, Martin Lebourdais, Hervé Bredin, Séverin Baroudi, Ricard Marxer

Abstract:This paper describes the submissions of team TalTech-IRIT-LIS to the DISPLACE 2024 challenge. Our team participated in the speaker diarization and language diarization tracks of the challenge. In the speaker diarization track, our best submission was an ensemble of systems based on the pyannote.audio speaker diarization pipeline utilizing powerset training and our recently proposed PixIT method that performs joint diarization and speech separation. We improve upon PixIT by using the separation outputs for speaker embedding extraction. Our ensemble achieved a diarization error rate of 27.1% on the evaluation dataset. In the language diarization track, we fine-tuned a pre-trained Wav2Vec2-BERT language embedding model on in-domain data, and clustered short segments using AHC and VBx, based on similarity scores from LDA/PLDA. This led to a language diarization error rate of 27.6% on the evaluation data. Both results were ranked first in their respective challenge tracks.

* accepted at Interspeech 2024

Via

Access Paper or Ask Questions

Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation

Jul 04, 2024

Tiia Sildam, Andra Velve, Tanel Alumäe

Abstract:This paper investigates the finetuning of end-to-end models for bidirectional Estonian-English and Estonian-Russian conversational speech-to-text translation. Due to the limited availability of speech translation data for Estonian, we created additional training data by web scraping and synthesizing data from speech recognition datasets using machine translation. We evaluated three publicly available end-to-end models: Whisper, OWSM 3.1, and SeamlessM4T. Our results indicate that fine-tuning with synthetic data enhances translation accuracy by a large margin, with SeamlessM4T matching or surpassing cascaded speech translation systems that use state-of-the-art speech recognition and machine translation models.

* Accepted to LoResMT 2024 (ACL workshop)

Via

Access Paper or Ask Questions

PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

Mar 04, 2024

Joonas Kalda, Clément Pagés, Ricard Marxer, Tanel Alumäe, Hervé Bredin

Abstract:A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) was proposed as an unsupervised alternative that uses real recordings, yet struggles with overseparation and adapting to long-form audio. We introduce PixIT, a joint approach that combines permutation invariant training (PIT) for speaker diarization (SD) and MixIT for SSep. With a small extra requirement of needing SD labels, it solves the problem of overseparation and allows stitching local separated sources leveraging existing work on clustering-based neural SD. We measure the quality of the separated sources via applying automatic speech recognition (ASR) systems to them. PixIT boosts the performance of various ASR systems across two meeting corpora both in terms of the speaker-attributed and utterance-based word error rates while not requiring any fine-tuning.

* submitted to Speaker Odyssey 2024

Via

Access Paper or Ask Questions

Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge

Oct 26, 2023

Tanel Alumäe, Jiaming Kong, Daniil Robnikov

Figure 1 for Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge

Figure 2 for Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge

Figure 3 for Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge

Figure 4 for Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge

Abstract:This paper describes Tallinn University of Technology (TalTech) systems developed for the ASRU MADASR 2023 Challenge. The challenge focuses on automatic speech recognition of dialect-rich Indian languages with limited training audio and text data. TalTech participated in two tracks of the challenge: Track 1 that allowed using only the provided training data and Track 3 which allowed using additional audio data. In both tracks, we relied on wav2vec2.0 models. Our methodology diverges from the traditional procedure of finetuning pretrained wav2vec2.0 models in two key points: firstly, through the implementation of the aligned data augmentation technique to enhance the linguistic diversity of the training data, and secondly, via the application of deep prefix tuning for dialect adaptation of wav2vec2.0 models. In both tracks, our approach yielded significant improvements over the provided baselines, achieving the lowest word error rates across all participating teams.

* 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Via

Access Paper or Ask Questions

Collar-aware Training for Streaming Speaker Change Detection in Broadcast Speech

May 14, 2022

Joonas Kalda, Tanel Alumäe

Figure 1 for Collar-aware Training for Streaming Speaker Change Detection in Broadcast Speech

Figure 2 for Collar-aware Training for Streaming Speaker Change Detection in Broadcast Speech

Figure 3 for Collar-aware Training for Streaming Speaker Change Detection in Broadcast Speech

Figure 4 for Collar-aware Training for Streaming Speaker Change Detection in Broadcast Speech

Abstract:In this paper, we present a novel training method for speaker change detection models. Speaker change detection is often viewed as a binary sequence labelling problem. The main challenges with this approach are the vagueness of annotated change points caused by the silences between speaker turns and imbalanced data due to the majority of frames not including a speaker change. Conventional training methods tackle these by artificially increasing the proportion of positive labels in the training data. Instead, the proposed method uses an objective function which encourages the model to predict a single positive label within a specified collar. This is done by marginalizing over all possible subsequences that have exactly one positive label within the collar. Experiments on English and Estonian datasets show large improvements over the conventional training method. Additionally, the model outputs have peaks concentrated to a single frame, removing the need for post-processing to find the exact predicted change point which is particularly useful for streaming applications.

* Accepted to Speaker Odyssey 2022

Via

Access Paper or Ask Questions

Pretraining Approaches for Spoken Language Recognition: TalTech Submission to the OLR 2021 Challenge

May 14, 2022

Tanel Alumäe, Kunnar Kukk

Figure 1 for Pretraining Approaches for Spoken Language Recognition: TalTech Submission to the OLR 2021 Challenge

Figure 2 for Pretraining Approaches for Spoken Language Recognition: TalTech Submission to the OLR 2021 Challenge

Figure 3 for Pretraining Approaches for Spoken Language Recognition: TalTech Submission to the OLR 2021 Challenge

Figure 4 for Pretraining Approaches for Spoken Language Recognition: TalTech Submission to the OLR 2021 Challenge

Abstract:This paper investigates different pretraining approaches to spoken language identification. The paper is based on our submission to the Oriental Language Recognition 2021 Challenge. We participated in two tracks of the challenge: constrained and unconstrained language recognition. For the constrained track, we first trained a Conformer-based encoder-decoder model for multilingual automatic speech recognition (ASR), using the provided training data that had transcripts available. The shared encoder of the multilingual ASR model was then finetuned for the language identification task. For the unconstrained task, we relied on both externally available pretrained models as well as external data: the multilingual XLSR-53 wav2vec2.0 model was finetuned on the VoxLingua107 corpus for the language recognition task, and finally finetuned on the provided target language training data, augmented with CommonVoice data. Our primary metric $C_{\rm avg}$ values on the Test set are 0.0079 for the constrained task and 0.0119 for the unconstrained task which resulted in the second place in both rankings. In post-evaluation experiments, we study the amount of target language data needed for training an accurate backend model, the importance of multilingual pretraining data, and compare different models as finetuning starting points.

* Accepted to Speaker Odyssey 2022

Via

Access Paper or Ask Questions

Improving Language Identification of Accented Speech

Apr 01, 2022

Kunnar Kukk, Tanel Alumäe

Figure 1 for Improving Language Identification of Accented Speech

Figure 2 for Improving Language Identification of Accented Speech

Figure 3 for Improving Language Identification of Accented Speech

Figure 4 for Improving Language Identification of Accented Speech

Abstract:Language identification from speech is a common preprocessing step in many spoken language processing systems. In recent years, this field has seen a fast progress, mostly due to the use of self-supervised models pretrained on multilingual data and the use of large training corpora. This paper shows that for speech with a non-native or regional accent, the accuracy of spoken language identification systems drops dramatically, and that the accuracy of identifying the language is inversely correlated with the strength of the accent. We also show that using the output of a lexicon-free speech recognition system of the particular language helps to improve language identification performance on accented speech by a large margin, without sacrificing accuracy on native speech. We obtain relative error rate reductions ranging from to 35 to 63% over the state-of-the-art model across several non-native speech datasets.

* This paper has been submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Robust Training of Vector Quantized Bottleneck Models

May 18, 2020

Adrian Łańcucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans J. G. A. Dolfing, Sameer Khurana, Tanel Alumäe, Antoine Laurent

Figure 1 for Robust Training of Vector Quantized Bottleneck Models

Figure 2 for Robust Training of Vector Quantized Bottleneck Models

Figure 3 for Robust Training of Vector Quantized Bottleneck Models

Figure 4 for Robust Training of Vector Quantized Bottleneck Models

Abstract:In this paper we demonstrate methods for reliable and efficient training of discrete representation using Vector-Quantized Variational Auto-Encoder models (VQ-VAEs). Discrete latent variable models have been shown to learn nontrivial representations of speech, applicable to unsupervised voice conversion and reaching state-of-the-art performance on unit discovery tasks. For unsupervised representation learning, they became viable alternatives to continuous latent variable models such as the Variational Auto-Encoder (VAE). However, training deep discrete variable models is challenging, due to the inherent non-differentiability of the discretization operation. In this paper we focus on VQ-VAE, a state-of-the-art discrete bottleneck model shown to perform on par with its continuous counterparts. It quantizes encoder outputs with on-line $k$-means clustering. We show that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs. We demonstrate that these can be successfully overcome by increasing the learning rate for the codebook and periodic date-dependent codeword re-initialization. As a result, we achieve more robust training across different tasks, and significantly increase the usage of latent codewords even for large codebooks. This has practical benefit, for instance, in unsupervised representation learning, where large codebooks may lead to disentanglement of latent representations.

* Published at IJCNN 2020

Via

Access Paper or Ask Questions

Advanced Rich Transcription System for Estonian Speech

Jan 11, 2019

Tanel Alumäe, Ottokar Tilk, Asadullah

Figure 1 for Advanced Rich Transcription System for Estonian Speech

Figure 2 for Advanced Rich Transcription System for Estonian Speech

Figure 3 for Advanced Rich Transcription System for Estonian Speech

Figure 4 for Advanced Rich Transcription System for Estonian Speech

Abstract:This paper describes the current TT\"U speech transcription system for Estonian speech. The system is designed to handle semi-spontaneous speech, such as broadcast conversations, lecture recordings and interviews recorded in diverse acoustic conditions. The system is based on the Kaldi toolkit. Multi-condition training using background noise profiles extracted automatically from untranscribed data is used to improve the robustness of the system. Out-of-vocabulary words are recovered using a phoneme n-gram based decoding subgraph and a FST-based phoneme-to-grapheme model. The system achieves a word error rate of 8.1% on a test set of broadcast conversations. The system also performs punctuation recovery and speaker identification. Speaker identification models are trained using a recently proposed weakly supervised training method.

* Series: Frontiers in Artificial Intelligence and Applications; Ebook Volume 307: Human Language Technologies -- The Baltic Perspective, 2018
* Published in Baltic HLT 2018 (putting it on arXiv because Google Scholar doesn't index it properly)

Via

Access Paper or Ask Questions