Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Herman Kamper

Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis

Jul 02, 2025

Marc-André Carbonneau, Benjamin van Niekerk, Hugo Seuté, Jean-Philippe Letendre, Herman Kamper, Julian Zaïdi

Abstract:Modeling voice identity is challenging due to its multifaceted nature. In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings, designed for discrimination rather than characterizing identity. This paper investigates which aspects of a voice are captured in such representations. We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm. We also identify confounding factors that compromise speaker similarity measurements and suggest mitigation strategies. To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns. This work contributes to the ongoing challenge of assessing speaker identity consistency in the context of ever-better voice cloning systems. We publicly release our code.

* Accepted at SSW13 - Interspeech 2025 Speech Synthesis Workshop

Via

Access Paper or Ask Questions

The mutual exclusivity bias of bilingual visually grounded speech models

Jun 04, 2025

Dan Oneata, Leanne Nortje, Yevgen Matusevych, Herman Kamper

Abstract:Mutual exclusivity (ME) is a strategy where a novel word is associated with a novel object rather than a familiar one, facilitating language learning in children. Recent work has found an ME bias in a visually grounded speech (VGS) model trained on English speech with paired images. But ME has also been studied in bilingual children, who may employ it less due to cross-lingual ambiguity. We explore this pattern computationally using bilingual VGS models trained on combinations of English, French, and Dutch. We find that bilingual models generally exhibit a weaker ME bias than monolingual models, though exceptions exist. Analyses show that the combined visual embeddings of bilingual models have a smaller variance for familiar data, partly explaining the increase in confusion between novel and familiar concepts. We also provide new insights into why the ME bias exists in VGS models in the first place. Code and data: https://github.com/danoneata/me-vgs

* Interspeech 2025

Via

Access Paper or Ask Questions

Spoken Language Modeling with Duration-Penalized Self-Supervised Units

May 29, 2025

Nicol Visser, Herman Kamper

Abstract:Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic language modeling tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren't always better, but that DPDP is a simple and efficient way to obtain coarser units for the tasks where they are beneficial.

* Accepted to Interspeech 2025

Via

Access Paper or Ask Questions

Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives

Jan 11, 2025

Christiaan Jacobs, Annelien Smith, Daleen Klop, Ondřej Klejch, Febe de Wet, Herman Kamper

Abstract:We develop automatic speech recognition (ASR) systems for stories told by Afrikaans and isiXhosa preschool children. Oral narratives provide a way to assess children's language development before they learn to read. We consider a range of prior child-speech ASR strategies to determine which is best suited to this unique setting. Using Whisper and only 5 minutes of transcribed in-domain child speech, we find that additional in-domain adult data (adult speech matching the story domain) provides the biggest improvement, especially when coupled with voice conversion. Semi-supervised learning also helps for both languages, while parameter-efficient fine-tuning helps on Afrikaans but not on isiXhosa (which is under-represented in the Whisper model). Few child-speech studies look at non-English data, and even fewer at the preschool ages of 4 and 5. Our work therefore represents a unique validation of a wide range of previous child-speech ASR strategies in an under-explored setting.

* Accepted to ICASSP 2025

Via

Access Paper or Ask Questions

MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model

Jan 10, 2025

Matthew Baas, Pieter Scholtz, Arnav Mehta, Elliott Dyson, Akshat Prakash, Herman Kamper

Abstract:Codec-based text-to-speech (TTS) models have shown impressive quality with zero-shot voice cloning abilities. However, they often struggle with more expressive references or complex text inputs. We present MARS6, a robust encoder-decoder transformer for rapid, expressive TTS. MARS6 is built on recent improvements in spoken language modelling. Utilizing a hierarchical setup for its decoder, new speech tokens are processed at a rate of only 12 Hz, enabling efficient modelling of long-form text while retaining reconstruction quality. We combine several recent training and inference techniques to reduce repetitive generation and improve output stability and quality. This enables the 70M-parameter MARS6 to achieve similar performance to models many times larger. We show this in objective and subjective evaluations, comparing TTS output quality and reference speaker cloning ability. Project page: https://camb-ai.github.io/mars6-turbo/

* 5 pages, 2 figures, 1 table. Accepted at ICASSP 2025

Via

Access Paper or Ask Questions

Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Sep 22, 2024

Simon Malan, Benjamin van Niekerk, Herman Kamper

Figure 1 for Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Figure 2 for Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Figure 3 for Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Figure 4 for Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Abstract:We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster.

* 3 figures, 3 tables

Via

Access Paper or Ask Questions

Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

Sep 09, 2024

Leanne Nortje, Dan Oneata, Herman Kamper

Figure 1 for Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

Figure 2 for Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

Figure 3 for Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

Figure 4 for Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

Abstract:Given an image query, visually prompted keyword localisation (VPKL) aims to find occurrences of the depicted word in a speech collection. This can be useful when transcriptions are not available for a low-resource language (e.g. if it is unwritten). Previous work showed that VPKL can be performed with a visually grounded speech model trained on paired images and unlabelled speech. But all experiments were done on English. Moreover, transcriptions were used to get positive and negative pairs for the contrastive loss. This paper introduces a few-shot learning scheme to mine pairs automatically without transcriptions. On English, this results in only a small drop in performance. We also - for the first time - consider VPKL on a real low-resource language, Yoruba. While scores are reasonable, here we see a bigger drop in performance compared to using ground truth pairs because the mining is less accurate in Yoruba.

Via

Access Paper or Ask Questions

Spoken-Term Discovery using Discrete Speech Units

Aug 26, 2024

Benjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau, Herman Kamper

Abstract:Discovering a lexicon from unlabeled audio is a longstanding challenge for zero-resource speech processing. One approach is to search for frequently occurring patterns in speech. We revisit this idea with DUSTED: Discrete Unit Spoken-TErm Discovery. Leveraging self-supervised models, we encode input audio into sequences of discrete units. Next, we find repeated patterns by searching for similar unit sub-sequences, inspired by alignment algorithms from bioinformatics. Since discretization discards speaker information, DUSTED finds better matches across speakers, improving the coverage and consistency of the discovered patterns. We demonstrate these improvements on the ZeroSpeech Challenge, achieving state-of-the-art results on the spoken-term discovery track. Finally, we analyze the duration distribution of the patterns, showing that our method finds longer word- or phrase-like terms.

* Accepted to Interspeech 2024

Via

Access Paper or Ask Questions

Translating speech with just images

Jun 11, 2024

Dan Oneata, Herman Kamper

Abstract:Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.

* Accepted at Interspeech 2024

Via

Access Paper or Ask Questions

Visually Grounded Speech Models have a Mutual Exclusivity Bias

Mar 20, 2024

Leanne Nortje, Dan Oneaţă, Yevgen Matusevych, Herman Kamper

Abstract:When children learn new words, they employ constraints such as the mutual exclusivity (ME) bias: a novel word is mapped to a novel object rather than a familiar one. This bias has been studied computationally, but only in models that use discrete word representations as input, ignoring the high variability of spoken words. We investigate the ME bias in the context of visually grounded speech models that learn from natural images and continuous speech audio. Concretely, we train a model on familiar words and test its ME bias by asking it to select between a novel and a familiar object when queried with a novel word. To simulate prior acoustic and visual knowledge, we experiment with several initialisation strategies using pretrained speech and vision networks. Our findings reveal the ME bias across the different initialisation approaches, with a stronger bias in models with more prior (in particular, visual) knowledge. Additional tests confirm the robustness of our results, even when different loss functions are considered.

* Accepted to TACL, pre-MIT Press publication version

Via

Access Paper or Ask Questions