Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Themos Stafylakis

Analysis of ABC Frontend Audio Systems for the NIST-SRE24

May 21, 2025

Sara Barahona, Anna Silnova, Ladislav Mošner, Junyi Peng, Oldřich Plchot, Johan Rohdin, Lin Zhang, Jiangyu Han, Petr Palka, Federico Landini(+7 more)

Abstract:We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-the-art frontends for speaker recognition.

* Accepted at Interspeech 2025

Via

Access Paper or Ask Questions

State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data

Oct 03, 2024

Sara Barahona, Ladislav Mošner, Themos Stafylakis, Oldřich Plchot, Junyi Peng, Lukáš Burget, Jan Černocký

Figure 1 for State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data

Figure 2 for State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data

Figure 3 for State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data

Figure 4 for State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data

Abstract:In this paper, we refine and validate our method for training speaker embedding extractors using weak annotations. More specifically, we use only the audio stream of the source VoxCeleb videos and the names of the celebrities without knowing the time intervals in which they appear in the recording. We experiment with hyperparameters and embedding extractors based on ResNet and WavLM. We show that the method achieves state-of-the-art results in speaker verification, comparable with training the extractors in a standard supervised way on the VoxCeleb dataset. We also extend it by considering segments belonging to unknown speakers appearing alongside the celebrities, which are typically being discarded. Overall, our approach can be used for directly training state-of-the-art embedding extractors or as an alternative to the VoxCeleb-like pipeline for dataset creation without needing image modality.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

BUT Systems and Analyses for the ASVspoof 5 Challenge

Aug 20, 2024

Johan Rohdin, Lin Zhang, Oldřich Plchot, Vojtěch Staněk, David Mihola, Junyi Peng, Themos Stafylakis, Dmitriy Beveraki, Anna Silnova, Jan Brukner(+1 more)

Figure 1 for BUT Systems and Analyses for the ASVspoof 5 Challenge

Figure 2 for BUT Systems and Analyses for the ASVspoof 5 Challenge

Figure 3 for BUT Systems and Analyses for the ASVspoof 5 Challenge

Figure 4 for BUT Systems and Analyses for the ASVspoof 5 Challenge

Abstract:This paper describes the BUT submitted systems for the ASVspoof 5 challenge, along with analyses. For the conventional deepfake detection task, we use ResNet18 and self-supervised models for the closed and open conditions, respectively. In addition, we analyze and visualize different combinations of speaker information and spoofing information as label schemes for training. For spoofing-robust automatic speaker verification (SASV), we introduce effective priors and propose using logistic regression to jointly train affine transformations of the countermeasure scores and the automatic speaker verification scores in such a way that the SASV LLR is optimized.

* 8 pages, ASVspoof 5 Workshop (Interspeech2024 Satellite)

Via

Access Paper or Ask Questions

Challenging margin-based speaker embedding extractors by using the variational information bottleneck

Jun 18, 2024

Themos Stafylakis, Anna Silnova, Johan Rohdin, Oldrich Plchot, Lukas Burget

Figure 1 for Challenging margin-based speaker embedding extractors by using the variational information bottleneck

Figure 2 for Challenging margin-based speaker embedding extractors by using the variational information bottleneck

Figure 3 for Challenging margin-based speaker embedding extractors by using the variational information bottleneck

Abstract:Speaker embedding extractors are typically trained using a classification loss over the training speakers. During the last few years, the standard softmax/cross-entropy loss has been replaced by the margin-based losses, yielding significant improvements in speaker recognition accuracy. Motivated by the fact that the margin merely reduces the logit of the target speaker during training, we consider a probabilistic framework that has a similar effect. The variational information bottleneck provides a principled mechanism for making deterministic nodes stochastic, resulting in an implicit reduction of the posterior of the target speaker. We experiment with a wide range of speaker recognition benchmarks and scoring methods and report competitive results to those obtained with the state-of-the-art Additive Angular Margin loss.

* Accepted at Interspeech 2024

Via

Access Paper or Ask Questions

Comparing Data Augmentation Methods for End-to-End Task-Oriented Dialog Systems

Jun 10, 2024

Christos Vlachos, Themos Stafylakis, Ion Androutsopoulos

Figure 1 for Comparing Data Augmentation Methods for End-to-End Task-Oriented Dialog Systems

Figure 2 for Comparing Data Augmentation Methods for End-to-End Task-Oriented Dialog Systems

Figure 3 for Comparing Data Augmentation Methods for End-to-End Task-Oriented Dialog Systems

Figure 4 for Comparing Data Augmentation Methods for End-to-End Task-Oriented Dialog Systems

Abstract:Creating effective and reliable task-oriented dialog systems (ToDSs) is challenging, not only because of the complex structure of these systems, but also due to the scarcity of training data, especially when several modules need to be trained separately, each one with its own input/output training examples. Data augmentation (DA), whereby synthetic training examples are added to the training data, has been successful in other NLP systems, but has not been explored as extensively in ToDSs. We empirically evaluate the effectiveness of DA methods in an end-to-end ToDS setting, where a single system is trained to handle all processing stages, from user inputs to system outputs. We experiment with two ToDSs (UBAR, GALAXY) on two datasets (MultiWOZ, KVRET). We consider three types of DA methods (word-level, sentence-level, dialog-level), comparing eight DA methods that have shown promising results in ToDSs and other NLP systems. We show that all DA methods considered are beneficial, and we highlight the best ones, also providing advice to practitioners. We also introduce a more challenging few-shot cross-domain ToDS setting, reaching similar conclusions.

* There are 25 pages in total, 23 tables, 18 figures. Accepted in ACL 2024

Via

Access Paper or Ask Questions

Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

Feb 29, 2024

Lin Zhang, Themos Stafylakis, Federico Landini, Mireia Diez, Anna Silnova, Lukáš Burget

Figure 1 for Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

Figure 2 for Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

Figure 3 for Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

Figure 4 for Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

Abstract:In this paper, we apply the variational information bottleneck approach to end-to-end neural diarization with encoder-decoder attractors (EEND-EDA). This allows us to investigate what information is essential for the model. EEND-EDA utilizes vector representations of the speakers in a conversation - attractors. Our analysis shows that, attractors do not necessarily have to contain speaker characteristic information. On the other hand, giving the attractors more freedom allowing them to encode some extra (possibly speaker-specific) information leads to small but consistent diarization performance improvements. Despite architectural differences in EEND systems, the notion of attractors and frame embeddings is common to most of them and not specific to EEND-EDA. We believe that the main conclusions of this work can apply to other variants of EEND. Thus, we hope this paper will be a valuable contribution to guide the community to make more informed decisions when designing new systems.

* Submitted to Odyssey 2024

Via

Access Paper or Ask Questions

DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

Dec 22, 2023

Federico Landini, Mireia Diez, Themos Stafylakis, Lukáš Burget

Figure 1 for DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

Figure 2 for DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

Figure 3 for DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

Figure 4 for DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

Abstract:Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and running inference on almost half of the time on long recordings. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.

Via

Access Paper or Ask Questions

A Simple Baseline for Knowledge-Based Visual Question Answering

Oct 24, 2023

Alexandros Xenos, Themos Stafylakis, Ioannis Patras, Georgios Tzimiropoulos

Figure 1 for A Simple Baseline for Knowledge-Based Visual Question Answering

Figure 2 for A Simple Baseline for Knowledge-Based Visual Question Answering

Figure 3 for A Simple Baseline for Knowledge-Based Visual Question Answering

Figure 4 for A Simple Baseline for Knowledge-Based Visual Question Answering

Abstract:This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA). Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively. A common limitation of such approaches is that they consist of relatively complicated pipelines and often heavily rely on accessing GPT-3 API. Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline which, in a nutshell, is based on efficient in-context learning by prompting LLaMA (1 and 2) using question-informative captions as contextual information. Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and yet achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to understand important aspects of our method. Our code is publicly available at https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA

* Accepted at EMNLP 2023 (camera-ready version)

Via

Access Paper or Ask Questions

Improving Speaker Verification with Self-Pretrained Transformer Models

May 17, 2023

Junyi Peng, Oldřich Plchot, Themos Stafylakis, Ladislav Mošner, Lukáš Burget, Jan Černocký

Figure 1 for Improving Speaker Verification with Self-Pretrained Transformer Models

Figure 2 for Improving Speaker Verification with Self-Pretrained Transformer Models

Figure 3 for Improving Speaker Verification with Self-Pretrained Transformer Models

Figure 4 for Improving Speaker Verification with Self-Pretrained Transformer Models

Abstract:Recently, fine-tuning large pre-trained Transformer models using downstream datasets has received a rising interest. Despite their success, it is still challenging to disentangle the benefits of large-scale datasets and Transformer structures from the limitations of the pre-training. In this paper, we introduce a hierarchical training approach, named self-pretraining, in which Transformer models are pretrained and finetuned on the same dataset. Three pre-trained models including HuBERT, Conformer and WavLM are evaluated on four different speaker verification datasets with varying sizes. Our experiments show that these self-pretrained models achieve competitive performance on downstream speaker verification tasks with only one-third of the data compared to Librispeech pretraining, such as VoxCeleb1 and CNCeleb1. Furthermore, when pre-training only on the VoxCeleb2-dev, the Conformer model outperforms the one pre-trained on 94k hours of data using the same fine-tuning settings.

* Accepted to Interspeech 2023

Via

Access Paper or Ask Questions

Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing

Nov 03, 2022

Sofoklis Kakouros, Themos Stafylakis, Ladislav Mosner, Lukas Burget

Abstract:When recognizing emotions from speech, we encounter two common problems: how to optimally capture emotion-relevant information from the speech signal and how to best quantify or categorize the noisy subjective emotion labels. Self-supervised pre-trained representations can robustly capture information from speech enabling state-of-the-art results in many downstream tasks including emotion recognition. However, better ways of aggregating the information across time need to be considered as the relevant emotion information is likely to appear piecewise and not uniformly across the signal. For the labels, we need to take into account that there is a substantial degree of noise that comes from the subjective human annotations. In this paper, we propose a novel approach to attentive pooling based on correlations between the representations' coefficients combined with label smoothing, a method aiming to reduce the confidence of the classifier on the training labels. We evaluate our proposed approach on the benchmark dataset IEMOCAP, and demonstrate high performance surpassing that in the literature. The code to reproduce the results is available at github.com/skakouros/s3prl_attentive_correlation.

* Submitted to IEEE-ICASSP 2023

Via

Access Paper or Ask Questions