Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jakob Poncelet

Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling

Feb 05, 2025

Jakob Poncelet, Hugo Van hamme

Figure 1 for Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling

Figure 2 for Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling

Figure 3 for Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling

Figure 4 for Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling

Abstract:The recent advancement of speech recognition technology has been driven by large-scale datasets and attention-based architectures, but many challenges still remain, especially for low-resource languages and dialects. This paper explores the integration of weakly supervised transcripts from TV subtitles into automatic speech recognition (ASR) systems, aiming to improve both verbatim transcriptions and automatically generated subtitles. To this end, verbatim data and subtitles are regarded as different domains or languages, due to their distinct characteristics. We propose and compare several end-to-end architectures that are designed to jointly model both modalities with separate or shared encoders and decoders. The proposed methods are able to jointly generate a verbatim transcription and a subtitle. Evaluation on Flemish (Belgian Dutch) demonstrates that a model with cascaded encoders and separate decoders allows to represent the differences between the two data types most efficiently while improving on both domains. Despite differences in domain and linguistic variations, combining verbatim transcripts with subtitle data leads to notable ASR improvements without the need for extensive preprocessing. Additionally, experiments with a large-scale subtitle dataset show the scalability of the proposed approach. The methods not only improve ASR accuracy but also generate subtitles that closely match standard written text, offering several potential applications.

* Preprint

Via

Access Paper or Ask Questions

Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models

Sep 04, 2024

Jakob Poncelet, Yujun Wang, Hugo Van hamme

Figure 1 for Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models

Figure 2 for Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models

Figure 3 for Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models

Figure 4 for Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models

Abstract:Continuous speech can be converted into a discrete sequence by deriving discrete units from the hidden features of self-supervised learned (SSL) speech models. Although SSL models are becoming larger and trained on more data, they are often sensitive to real-life distortions like additive noise or reverberation, which translates to a shift in discrete units. We propose a parameter-efficient approach to generate noise-robust discrete units from pre-trained SSL models by training a small encoder-decoder model, with or without adapters, to simultaneously denoise and discretise the hidden features of the SSL model. The model learns to generate a clean discrete sequence for a noisy utterance, conditioned on the SSL features. The proposed denoiser outperforms several pre-training methods on the tasks of noisy discretisation and noisy speech recognition, and can be finetuned to the target environment with a few recordings of unlabeled target data.

* Accepted at SLT2024

Via

Access Paper or Ask Questions

Unsupervised Accent Adaptation Through Masked Language Model Correction Of Discrete Self-Supervised Speech Units

Sep 25, 2023

Jakob Poncelet, Hugo Van hamme

Figure 1 for Unsupervised Accent Adaptation Through Masked Language Model Correction Of Discrete Self-Supervised Speech Units

Figure 2 for Unsupervised Accent Adaptation Through Masked Language Model Correction Of Discrete Self-Supervised Speech Units

Figure 3 for Unsupervised Accent Adaptation Through Masked Language Model Correction Of Discrete Self-Supervised Speech Units

Figure 4 for Unsupervised Accent Adaptation Through Masked Language Model Correction Of Discrete Self-Supervised Speech Units

Abstract:Self-supervised pre-trained speech models have strongly improved speech recognition, yet they are still sensitive to domain shifts and accented or atypical speech. Many of these models rely on quantisation or clustering to learn discrete acoustic units. We propose to correct the discovered discrete units for accented speech back to a standard pronunciation in an unsupervised manner. A masked language model is trained on discrete units from a standard accent and iteratively corrects an accented token sequence by masking unexpected cluster sequences and predicting their common variant. Small accent adapter blocks are inserted in the pre-trained model and fine-tuned by predicting the corrected clusters, which leads to an increased robustness of the pre-trained model towards a target accent, and this without supervision. We are able to improve a state-of-the-art HuBERT Large model on a downstream accented speech recognition task by altering the training regime with the proposed method.

* Submitted to ICASSP2024

Via

Access Paper or Ask Questions

Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition

Oct 14, 2022

Jakob Poncelet, Hugo Van hamme

Figure 1 for Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition

Figure 2 for Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition

Figure 3 for Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition

Figure 4 for Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition

Abstract:TV subtitles are a rich source of transcriptions of many types of speech, ranging from read speech in news reports to conversational and spontaneous speech in talk shows and soaps. However, subtitles are not verbatim (i.e. exact) transcriptions of speech, so they cannot be used directly to improve an Automatic Speech Recognition (ASR) model. We propose a multitask dual-decoder Transformer model that jointly performs ASR and automatic subtitling. The ASR decoder (possibly pre-trained) predicts the verbatim output and the subtitle decoder generates a subtitle, while sharing the encoder. The two decoders can be independent or connected. The model is trained to perform both tasks jointly, and is able to effectively use subtitle data. We show improvements on regular ASR and on spontaneous and conversational ASR by incorporating the additional subtitle decoder. The method does not require preprocessing (aligning, filtering, pseudo-labeling, ...) of the subtitles.

* Accepted at SLT 2022

Via

Access Paper or Ask Questions

Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch

Sep 29, 2021

Jakob Poncelet, Hugo Van hamme

Figure 1 for Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch

Figure 2 for Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch

Figure 3 for Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch

Figure 4 for Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch

Abstract:Recent research in speech processing exhibits a growing interest in unsupervised and self-supervised representation learning from unlabelled data to alleviate the need for large amounts of annotated data. We investigate several popular pre-training methods and apply them to Flemish Dutch. We compare off-the-shelf English pre-trained models to models trained on an increasing amount of Flemish data. We find that the most important factors for positive transfer to downstream speech recognition tasks include a substantial amount of data and a matching pre-training domain. Ideally, we also finetune on an annotated subset in the target language. All pre-trained models improve linear phone separability in Flemish, but not all methods improve Automatic Speech Recognition. We experience superior performance with wav2vec 2.0 and we obtain a 30% WER improvement by finetuning the multilingually pre-trained XLSR-53 model on Flemish Dutch, after integration into an HMM-DNN acoustic model.

* To be published in the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2021)

Via

Access Paper or Ask Questions