Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kevin El Haddad

Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

Aug 24, 2025

Hugo Bohy, Minh Tran, Kevin El Haddad, Thierry Dutoit, Mohammad Soleymani

Abstract:Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.

* 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)
* 5 pages, 3 figures, IEEE FG 2024 conference

Via

Access Paper or Ask Questions

ASR Benchmarking: Need for a More Representative Conversational Dataset

Sep 18, 2024

Gaurav Maheshwari, Dmitry Ivanov, Théo Johannet, Kevin El Haddad

Figure 1 for ASR Benchmarking: Need for a More Representative Conversational Dataset

Figure 2 for ASR Benchmarking: Need for a More Representative Conversational Dataset

Figure 3 for ASR Benchmarking: Need for a More Representative Conversational Dataset

Figure 4 for ASR Benchmarking: Need for a More Representative Conversational Dataset

Abstract:Automatic Speech Recognition (ASR) systems have achieved remarkable performance on widely used benchmarks such as LibriSpeech and Fleurs. However, these benchmarks do not adequately reflect the complexities of real-world conversational environments, where speech is often unstructured and contains disfluencies such as pauses, interruptions, and diverse accents. In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings. Furthermore, we observe a correlation between Word Error Rate and the presence of speech disfluencies, highlighting the critical need for more realistic, conversational ASR benchmarks.

Via

Access Paper or Ask Questions

Efficacy of Synthetic Data as a Benchmark

Sep 18, 2024

Gaurav Maheshwari, Dmitry Ivanov, Kevin El Haddad

Abstract:Large language models (LLMs) have enabled a range of applications in zero-shot and few-shot learning settings, including the generation of synthetic datasets for training and testing. However, to reliably use these synthetic datasets, it is essential to understand how representative they are of real-world data. We investigate this by assessing the effectiveness of generating synthetic data through LLM and using it as a benchmark for various NLP tasks. Our experiments across six datasets, and three different tasks, show that while synthetic data can effectively capture performance of various methods for simpler tasks, such as intent classification, it falls short for more complex tasks like named entity recognition. Additionally, we propose a new metric called the bias factor, which evaluates the biases introduced when the same LLM is used to both generate benchmarking data and to perform the tasks. We find that smaller LLMs exhibit biases towards their own generated data, whereas larger models do not. Overall, our findings suggest that the effectiveness of synthetic data as a benchmark varies depending on the task, and that practitioners should rely on data generated from multiple larger models whenever possible.

Via

Access Paper or Ask Questions

A New Perspective on Smiling and Laughter Detection: Intensity Levels Matter

Mar 04, 2024

Hugo Bohy, Kevin El Haddad, Thierry Dutoit

Figure 1 for A New Perspective on Smiling and Laughter Detection: Intensity Levels Matter

Figure 2 for A New Perspective on Smiling and Laughter Detection: Intensity Levels Matter

Figure 3 for A New Perspective on Smiling and Laughter Detection: Intensity Levels Matter

Figure 4 for A New Perspective on Smiling and Laughter Detection: Intensity Levels Matter

Abstract:Smiles and laughs detection systems have attracted a lot of attention in the past decade contributing to the improvement of human-agent interaction systems. But very few considered these expressions as distinct, although no prior work clearly proves them to belong to the same category or not. In this work, we present a deep learning-based multimodal smile and laugh classification system, considering them as two different entities. We compare the use of audio and vision-based models as well as a fusion approach. We show that, as expected, the fusion leads to a better generalization on unseen data. We also present an in-depth analysis of the behavior of these models on the smiles and laughs intensity levels. The analyses on the intensity levels show that the relationship between smiles and laughs might not be as simple as a binary one or even grouping them in a single category, and so, a more complex approach should be taken when dealing with them. We also tackle the problem of limited resources by showing that transfer learning allows the models to improve the detection of confusing intensity levels.

* In 2022 10th International Conference on Affective Computing and Intelligent Interaction (ACII) (pp. 1-8). IEEE

Via

Access Paper or Ask Questions

Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

Feb 20, 2024

Nicolas Boizard, Kevin El Haddad, Céline Hudelot, Pierre Colombo

Figure 1 for Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

Figure 2 for Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

Figure 3 for Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

Figure 4 for Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

Abstract:Deploying large language models (LLMs) of several billion parameters can be impractical in most industrial use cases due to constraints such as cost, latency limitations, and hardware accessibility. Knowledge distillation (KD) offers a solution by compressing knowledge from resource-intensive large models to smaller ones. Various strategies exist, some relying on the text generated by the teacher model and optionally utilizing his logits to enhance learning. However, these methods based on logits often require both teacher and student models to share the same tokenizer, limiting their applicability across different LLM families. In this paper, we introduce Universal Logit Distillation (ULD) loss, grounded in optimal transport, to address this limitation. Our experimental results demonstrate the effectiveness of ULD loss in enabling distillation across models with different architectures and tokenizers, paving the way to a more widespread use of distillation techniques.

* 9 pages, 5 figures

Via

Access Paper or Ask Questions

Deep learning-based stereo camera multi-video synchronization

Mar 22, 2023

Nicolas Boizard, Kevin El Haddad, Thierry Ravet, François Cresson, Thierry Dutoit

Abstract:Stereo vision is essential for many applications. Currently, the synchronization of the streams coming from two cameras is done using mostly hardware. A software-based synchronization method would reduce the cost, weight and size of the entire system and allow for more flexibility when building such systems. With this goal in mind, we present here a comparison of different deep learning-based systems and prove that some are efficient and generalizable enough for such a task. This study paves the way to a production ready software-based video synchronization system.

* 5 pages, 4 figures, Accepted at ICASSP 2023

Via

Access Paper or Ask Questions

Analysis and Assessment of Controllability of an Expressive Deep Learning-based TTS system

Mar 06, 2021

Noé Tits, Kevin El Haddad, Thierry Dutoit

Figure 1 for Analysis and Assessment of Controllability of an Expressive Deep Learning-based TTS system

Figure 2 for Analysis and Assessment of Controllability of an Expressive Deep Learning-based TTS system

Figure 3 for Analysis and Assessment of Controllability of an Expressive Deep Learning-based TTS system

Figure 4 for Analysis and Assessment of Controllability of an Expressive Deep Learning-based TTS system

Abstract:In this paper, we study the controllability of an Expressive TTS system trained on a dataset for a continuous control. The dataset is the Blizzard 2013 dataset based on audiobooks read by a female speaker containing a great variability in styles and expressiveness. Controllability is evaluated with both an objective and a subjective experiment. The objective assessment is based on a measure of correlation between acoustic features and the dimensions of the latent space representing expressiveness. The subjective assessment is based on a perceptual experiment in which users are shown an interface for Controllable Expressive TTS and asked to retrieve a synthetic utterance whose expressiveness subjectively corresponds to that a reference utterance.

Via

Access Paper or Ask Questions

ICE-Talk: an Interface for a Controllable Expressive Talking Machine

Aug 25, 2020

Noé Tits, Kevin El Haddad, Thierry Dutoit

Figure 1 for ICE-Talk: an Interface for a Controllable Expressive Talking Machine

Figure 2 for ICE-Talk: an Interface for a Controllable Expressive Talking Machine

Abstract:ICE-Talk is an open source web-based GUI that allows the use of a TTS system with controllable parameters via a text field and a clickable 2D plot. It enables the study of latent spaces for controllable TTS. Moreover it is implemented as a module that can be used as part of a Human-Agent interaction.

Via

Access Paper or Ask Questions

Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning

Aug 20, 2020

Noé Tits, Kevin El Haddad, Thierry Dutoit

Figure 1 for Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning

Figure 2 for Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning

Figure 3 for Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning

Figure 4 for Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning

Abstract:Despite the growing interest for expressive speech synthesis, synthesis of nonverbal expressions is an under-explored area. In this paper we propose an audio laughter synthesis system based on a sequence-to-sequence TTS synthesis system. We leverage transfer learning by training a deep learning model to learn to generate both speech and laughs from annotations. We evaluate our model with a listening test, comparing its performance to an HMM-based laughter synthesis one and assess that it reaches higher perceived naturalness. Our solution is a first step towards a TTS system that would be able to synthesize speech with a control on amusement level with laughter integration.

Via

Access Paper or Ask Questions

The Theory behind Controllable Expressive Speech Synthesis: a Cross-disciplinary Approach

Oct 14, 2019

Noé Tits, Kevin El Haddad, Thierry Dutoit

Figure 1 for The Theory behind Controllable Expressive Speech Synthesis: a Cross-disciplinary Approach

Figure 2 for The Theory behind Controllable Expressive Speech Synthesis: a Cross-disciplinary Approach

Figure 3 for The Theory behind Controllable Expressive Speech Synthesis: a Cross-disciplinary Approach

Figure 4 for The Theory behind Controllable Expressive Speech Synthesis: a Cross-disciplinary Approach

Abstract:As part of the Human-Computer Interaction field, Expressive speech synthesis is a very rich domain as it requires knowledge in areas such as machine learning, signal processing, sociology, psychology. In this Chapter, we will focus mostly on the technical side. From the recording of expressive speech to its modeling, the reader will have an overview of the main paradigms used in this field, through some of the most prominent systems and methods. We explain how speech can be represented and encoded with audio features. We present a history of the main methods of Text-to-Speech synthesis: concatenative, parametric and statistical parametric speech synthesis. Finally, we focus on the last one, with the last techniques modeling Text-to-Speech synthesis as a sequence-to-sequence problem. This enables the use of Deep Learning blocks such as Convolutional and Recurrent Neural Networks as well as Attention Mechanism. The last part of the Chapter intends to assemble the different aspects of the theory and summarize the concepts.

* 19 pages, 6 figures. To be published in the book "Human Computer Interaction" edited by Prof. Yves Rybarczyk, published by IntechOpen

Via

Access Paper or Ask Questions