Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bolaji Yusuf

DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition

Dec 30, 2024

Alexander Polok, Dominik Klement, Martin Kocour, Jiangyu Han, Federico Landini, Bolaji Yusuf, Matthew Wiesner, Sanjeev Khudanpur, Jan Černocký, Lukáš Burget

Abstract:Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model's focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for multi-speaker ASR, improves generalization to unseen speakers and enables more reliable transcription in real-world multi-speaker recordings. Additionally, we explore the integration of a connectionist temporal classification (CTC) head to Whisper and demonstrate its ability to improve transcription efficiency through hybrid decoding. Notably, we show that our approach is not limited to Whisper; it also provides similar benefits when applied to the Branchformer model. We validate DiCoW on real-world datasets, including AMI and NOTSOFAR-1 from CHiME-8 challenge, as well as synthetic benchmarks such as Libri2Mix and LibriCSS, enabling direct comparisons with previous methods. Results demonstrate that DiCoW enhances the model's target-speaker ASR capabilities while maintaining Whisper's accuracy and robustness on single-speaker data.

Via

Access Paper or Ask Questions

Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models

Jul 05, 2024

Bolaji Yusuf, Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran

Abstract:This paper explores speculative speech recognition (SSR), where we empower conventional automatic speech recognition (ASR) with speculation capabilities, allowing the recognizer to run ahead of audio. We introduce a metric for measuring SSR performance and we propose a model which does SSR by combining a RNN-Transducer-based ASR system with an audio-prefixed language model (LM). The ASR system transcribes ongoing audio and feeds the resulting transcripts, along with an audio-dependent prefix, to the LM, which speculates likely completions for the transcriptions. We experiment with a variety of ASR datasets on which show the efficacy our method and the feasibility of SSR as a method of reducing ASR latency.

* Interspeech 2024

Via

Access Paper or Ask Questions

Written Term Detection Improves Spoken Term Detection

Jul 05, 2024

Bolaji Yusuf, Murat Saraçlar

Abstract:End-to-end (E2E) approaches to keyword search (KWS) are considerably simpler in terms of training and indexing complexity when compared to approaches which use the output of automatic speech recognition (ASR) systems. This simplification however has drawbacks due to the loss of modularity. In particular, where ASR-based KWS systems can benefit from external unpaired text via a language model, current formulations of E2E KWS systems have no such mechanism. Therefore, in this paper, we propose a multitask training objective which allows unpaired text to be integrated into E2E KWS without complicating indexing and search. In addition to training an E2E KWS model to retrieve text queries from spoken documents, we jointly train it to retrieve text queries from masked written documents. We show empirically that this approach can effectively leverage unpaired text for KWS, with significant improvements in search performance across a wide variety of languages. We conduct analysis which indicates that these improvements are achieved because the proposed method improves document representations for words in the unpaired text. Finally, we show that the proposed method can be used for domain adaptation in settings where in-domain paired data is scarce or nonexistent.

* in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3213-3223, 2024
* IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2024. Code at https://github.com/bolajiy/golden-retriever

Via

Access Paper or Ask Questions

Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units

Jul 05, 2024

Bolaji Yusuf, Jan "Honza" Černocký, Murat Saraçlar

Figure 1 for Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units

Figure 2 for Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units

Figure 3 for Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units

Abstract:End-to-end (E2E) keyword search (KWS) has emerged as an alternative and complimentary approach to conventional keyword search which depends on the output of automatic speech recognition (ASR) systems. While E2E methods greatly simplify the KWS pipeline, they generally have worse performance than their ASR-based counterparts, which can benefit from pretraining with untranscribed data. In this work, we propose a method for pretraining E2E KWS systems with untranscribed data, which involves using acoustic unit discovery (AUD) to obtain discrete units for untranscribed data and then learning to locate sequences of such units in the speech. We conduct experiments across languages and AUD systems: we show that finetuning such a model significantly outperforms a model trained from scratch, and the performance improvements are generally correlated with the quality of the AUD system used for pretraining.

* Interspeech 2024. KWS code at: https://github.com/bolajiy/golden-retriever; AUD code at https://github.com/beer-asr/beer/tree/master/recipes/hshmm

Via

Access Paper or Ask Questions

End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

Aug 15, 2023

Bolaji Yusuf, Jan Cernocky, Murat Saraclar

Figure 1 for End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

Figure 2 for End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

Figure 3 for End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

Figure 4 for End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

Abstract:Conventional keyword search systems operate on automatic speech recognition (ASR) outputs, which causes them to have a complex indexing and search pipeline. This has led to interest in ASR-free approaches to simplify the search procedure. We recently proposed a neural ASR-free keyword search model which achieves competitive performance while maintaining an efficient and simplified pipeline, where queries and documents are encoded with a pair of recurrent neural network encoders and the encodings are combined with a dot-product. In this article, we extend this work with multilingual pretraining and detailed analysis of the model. Our experiments show that the proposed multilingual training significantly improves the model performance and that despite not matching a strong ASR-based conventional keyword search system for short queries and queries comprising in-vocabulary words, the proposed model outperforms the ASR-based system for long queries and queries that do not appear in the training data.

* in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3070-3080, 2023
* Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2023

Via

Access Paper or Ask Questions

On-the-fly Text Retrieval for End-to-End ASR Adaptation

Mar 20, 2023

Bolaji Yusuf, Aditya Gourav, Ankur Gandhe, Ivan Bulyko

Figure 1 for On-the-fly Text Retrieval for End-to-End ASR Adaptation

Figure 2 for On-the-fly Text Retrieval for End-to-End ASR Adaptation

Figure 3 for On-the-fly Text Retrieval for End-to-End ASR Adaptation

Figure 4 for On-the-fly Text Retrieval for End-to-End ASR Adaptation

Abstract:End-to-end speech recognition models are improved by incorporating external text sources, typically by fusion with an external language model. Such language models have to be retrained whenever the corpus of interest changes. Furthermore, since they store the entire corpus in their parameters, rare words can be challenging to recall. In this work, we propose augmenting a transducer-based ASR model with a retrieval language model, which directly retrieves from an external text corpus plausible completions for a partial ASR hypothesis. These completions are then integrated into subsequent predictions by an adapter, which is trained once, so that the corpus of interest can be switched without incurring the computational overhead of retraining. Our experiments show that the proposed model significantly improves the performance of a transducer baseline on a pair of question-answering datasets. Further, it outperforms shallow fusion on recognition of named entities by about 7 relative; when the two are combined, the relative improvement increases to 13%.

* Accepted to ICASSP 2023; Appendix added to include ablations that could not fit into the conference 4-page limit

Via

Access Paper or Ask Questions

USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Feb 12, 2022

Bolaji Yusuf, Ankur Gandhe, Alex Sokolov

Figure 1 for USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Figure 2 for USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Figure 3 for USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Figure 4 for USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Abstract:Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text auxiliary tasks with which it shares a decoder and parts of the encoder. When we jointly train ASR and masked language model with the 960-hour Librispeech and Opensubtitles data respectively, we observe WER reductions of 16% and 20% on test-other and test-clean respectively over an ASR-only baseline without any extra cost at inference time, and reductions of 6% and 8% compared to a stronger MUTE-L baseline which trains the decoder with the same text data as our model. We achieve further improvements when we train masked language model on Librispeech data or when we use machine translation as the auxiliary task, without significantly sacrificing performance on the task itself.

* 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022)

Via

Access Paper or Ask Questions

End-to-End Open Vocabulary Keyword Search

Aug 23, 2021

Bolaji Yusuf, Alican Gok, Batuhan Gundogdu, Murat Saraclar

Figure 1 for End-to-End Open Vocabulary Keyword Search

Figure 2 for End-to-End Open Vocabulary Keyword Search

Figure 3 for End-to-End Open Vocabulary Keyword Search

Figure 4 for End-to-End Open Vocabulary Keyword Search

Abstract:Recently, neural approaches to spoken content retrieval have become popular. However, they tend to be restricted in their vocabulary or in their ability to deal with imbalanced test settings. These restrictions limit their applicability in keyword search, where the set of queries is not known beforehand, and where the system should return not just whether an utterance contains a query but the exact location of any such occurrences. In this work, we propose a model directly optimized for keyword search. The model takes a query and an utterance as input and returns a sequence of probabilities for each frame of the utterance of the query having occurred in that frame. Experiments show that the proposed model not only outperforms similar end-to-end models on a task where the ratio of positive and negative trials is artificially balanced, but it is also able to deal with the far more challenging task of keyword search with its inherent imbalance. Furthermore, using our system to rescore the outputs an LVCSR-based keyword search system leads to significant improvements on the latter.

* Interspeech 2021

Via

Access Paper or Ask Questions

Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Jun 08, 2021

Marcely Zanon Boito, Bolaji Yusuf, Lucas Ondel, Aline Villavicencio, Laurent Besacier

Figure 1 for Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Figure 2 for Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Figure 3 for Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Figure 4 for Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Abstract:When documenting oral-languages, Unsupervised Word Segmentation (UWS) from speech is a useful, yet challenging, task. It can be performed from phonetic transcriptions, or in the absence of these, from the output of unsupervised speech discretization models. These discretization models are trained using raw speech only, producing discrete speech units which can be applied for downstream (text-based) tasks. In this paper we compare five of these models: three Bayesian and two neural approaches, with regards to the exploitability of the produced units for UWS. Two UWS models are experimented with and we report results for Finnish, Hungarian, Mboshi, Romanian and Russian in a low-resource setting (using only 5k sentences). Our results suggest that neural models for speech discretization are difficult to exploit in our setting, and that it might be necessary to adapt them to limit sequence length. We obtain our best UWS results by using the SHMM and H-SHMM Bayesian models, which produce high quality, yet compressed, discrete representations of the input speech signal.

Via

Access Paper or Ask Questions

A Hierarchical Subspace Model for Language-Attuned Acoustic Unit Discovery

Nov 09, 2020

Bolaji Yusuf, Lucas Ondel, Lukas Burget, Jan Cernocky, Murat Saraclar

Figure 1 for A Hierarchical Subspace Model for Language-Attuned Acoustic Unit Discovery

Figure 2 for A Hierarchical Subspace Model for Language-Attuned Acoustic Unit Discovery

Abstract:In this work, we propose a hierarchical subspace model for acoustic unit discovery. In this approach, we frame the task as one of learning embeddings on a low-dimensional phonetic subspace, and simultaneously specify the subspace itself as an embedding on a hyper-subspace. We train the hyper-subspace on a set of transcribed languages and transfer it to the target language. In the target language, we infer both the language and unit embeddings in an unsupervised manner, and in so doing, we simultaneously learn a subspace of units specific to that language and the units that dwell on it. We conduct our experiments on TIMIT and two low-resource languages: Mboshi and Yoruba. Results show that our model outperforms major acoustic unit discovery techniques, both in terms of clustering quality and segmentation accuracy.

* Submitted to ICASSP 2021

Via

Access Paper or Ask Questions