Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gautier Marti

Residual Speech Embeddings for Tone Classification: Removing Linguistic Content to Enhance Paralinguistic Analysis

Feb 26, 2025

Hamdan Al Ahbabi, Gautier Marti, Saeed AlMarri, Ibrahim Elfadel

Abstract:Self-supervised learning models for speech processing, such as wav2vec2, HuBERT, WavLM, and Whisper, generate embeddings that capture both linguistic and paralinguistic information, making it challenging to analyze tone independently of spoken content. In this work, we introduce a method for disentangling paralinguistic features from linguistic content by regressing speech embeddings onto their corresponding text embeddings and using the residuals as a representation of vocal tone. We evaluate this approach across multiple self-supervised speech embeddings, demonstrating that residual embeddings significantly improve tone classification performance compared to raw speech embeddings. Our results show that this method enhances linear separability, enabling improved classification even with simple models such as logistic regression. Visualization of the residual embeddings further confirms the successful removal of linguistic information while preserving tone-related features. These findings highlight the potential of residual embeddings for applications in sentiment analysis, speaker characterization, and paralinguistic speech processing.

Via

Access Paper or Ask Questions

Enriching Datasets with Demographics through Large Language Models: What's in a Name?

Sep 17, 2024

Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat

Abstract:Enriching datasets with demographic information, such as gender, race, and age from names, is a critical task in fields like healthcare, public policy, and social sciences. Such demographic insights allow for more precise and effective engagement with target populations. Despite previous efforts employing hidden Markov models and recurrent neural networks to predict demographics from names, significant limitations persist: the lack of large-scale, well-curated, unbiased, publicly available datasets, and the lack of an approach robust across datasets. This scarcity has hindered the development of traditional supervised learning approaches. In this paper, we demonstrate that the zero-shot capabilities of Large Language Models (LLMs) can perform as well as, if not better than, bespoke models trained on specialized data. We apply these LLMs to a variety of datasets, including a real-life, unlabelled dataset of licensed financial professionals in Hong Kong, and critically assess the inherent demographic biases in these models. Our work not only advances the state-of-the-art in demographic enrichment but also opens avenues for future research in mitigating biases in LLMs.

* 8 pages, 7 Tables, 5 Figures

Via

Access Paper or Ask Questions

cCorrGAN: Conditional Correlation GAN for Learning Empirical Conditional Distributions in the Elliptope

Jul 22, 2021

Gautier Marti, Victor Goubet, Frank Nielsen

Figure 1 for cCorrGAN: Conditional Correlation GAN for Learning Empirical Conditional Distributions in the Elliptope

Figure 2 for cCorrGAN: Conditional Correlation GAN for Learning Empirical Conditional Distributions in the Elliptope

Figure 3 for cCorrGAN: Conditional Correlation GAN for Learning Empirical Conditional Distributions in the Elliptope

Abstract:We propose a methodology to approximate conditional distributions in the elliptope of correlation matrices based on conditional generative adversarial networks. We illustrate the methodology with an application from quantitative finance: Monte Carlo simulations of correlated returns to compare risk-based portfolio construction methods. Finally, we discuss about current limitations and advocate for further exploration of the elliptope geometry to improve results.

* GSI 2021: Geometric Science of Information pp 613-620
* International Conference on Geometric Science of Information

Via

Access Paper or Ask Questions

CorrGAN: Sampling Realistic Financial Correlation Matrices Using Generative Adversarial Networks

Oct 21, 2019

Gautier Marti

Figure 1 for CorrGAN: Sampling Realistic Financial Correlation Matrices Using Generative Adversarial Networks

Figure 2 for CorrGAN: Sampling Realistic Financial Correlation Matrices Using Generative Adversarial Networks

Figure 3 for CorrGAN: Sampling Realistic Financial Correlation Matrices Using Generative Adversarial Networks

Figure 4 for CorrGAN: Sampling Realistic Financial Correlation Matrices Using Generative Adversarial Networks

Abstract:We propose a novel approach for sampling realistic financial correlation matrices. This approach is based on generative adversarial networks. Experiments demonstrate that generative adversarial networks are able to recover most of the known stylized facts about empirical correlation matrices estimated on asset returns. This is the first time such results are documented in the literature. Practical financial applications range from trading strategies enhancement to risk and portfolio stress testing. Such generative models can also help ground empirical finance deeper into science by allowing for falsifiability of statements and more objective comparison of empirical methods.

Via

Access Paper or Ask Questions

Autoregressive Convolutional Neural Networks for Asynchronous Time Series

Jun 12, 2018

Mikołaj Bińkowski, Gautier Marti, Philippe Donnat

Figure 1 for Autoregressive Convolutional Neural Networks for Asynchronous Time Series

Figure 2 for Autoregressive Convolutional Neural Networks for Asynchronous Time Series

Figure 3 for Autoregressive Convolutional Neural Networks for Asynchronous Time Series

Figure 4 for Autoregressive Convolutional Neural Networks for Asynchronous Time Series

Abstract:We propose Significance-Offset Convolutional Neural Network, a deep convolutional network architecture for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating mechanisms used in recurrent neural networks. It involves an AR-like weighting system, where the final predictor is obtained as a weighted sum of adjusted regressors, while the weights are datadependent functions learnt through a convolutional network. The architecture was designed for applications on asynchronous time series and is evaluated on such datasets: a hedge fund proprietary dataset of over 2 million quotes for a credit derivative index, an artificially generated noisy autoregressive series and UCI household electricity consumption dataset. The proposed architecture achieves promising results as compared to convolutional and recurrent neural networks.

* Proceedings of The 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 2018, to appear

Via

Access Paper or Ask Questions

Putting Self-Supervised Token Embedding on the Tables

Oct 25, 2017

Marc Szafraniec, Gautier Marti, Philippe Donnat

Figure 1 for Putting Self-Supervised Token Embedding on the Tables

Figure 2 for Putting Self-Supervised Token Embedding on the Tables

Figure 3 for Putting Self-Supervised Token Embedding on the Tables

Figure 4 for Putting Self-Supervised Token Embedding on the Tables

Abstract:Information distribution by electronic messages is a privileged means of transmission for many businesses and individuals, often under the form of plain-text tables. As their number grows, it becomes necessary to use an algorithm to extract text and numbers instead of a human. Usual methods are focused on regular expressions or on a strict structure in the data, but are not efficient when we have many variations, fuzzy structure or implicit labels. In this paper we introduce SC2T, a totally self-supervised model for constructing vector representations of tokens in semi-structured messages by using characters and context levels that address these issues. It can then be used for an unsupervised labeling of tokens, or be the basis for a semi-supervised information extraction system.

Via

Access Paper or Ask Questions

Optimal Transport vs. Fisher-Rao distance between Copulas for Clustering Multivariate Time Series

Nov 14, 2016

Gautier Marti, Sébastien Andler, Frank Nielsen, Philippe Donnat

Figure 1 for Optimal Transport vs. Fisher-Rao distance between Copulas for Clustering Multivariate Time Series

Figure 2 for Optimal Transport vs. Fisher-Rao distance between Copulas for Clustering Multivariate Time Series

Figure 3 for Optimal Transport vs. Fisher-Rao distance between Copulas for Clustering Multivariate Time Series

Figure 4 for Optimal Transport vs. Fisher-Rao distance between Copulas for Clustering Multivariate Time Series

Abstract:We present a methodology for clustering N objects which are described by multivariate time series, i.e. several sequences of real-valued random variables. This clustering methodology leverages copulas which are distributions encoding the dependence structure between several random variables. To take fully into account the dependence information while clustering, we need a distance between copulas. In this work, we compare renowned distances between distributions: the Fisher-Rao geodesic distance, related divergences and optimal transport, and discuss their advantages and disadvantages. Applications of such methodology can be found in the clustering of financial assets. A tutorial, experiments and implementation for reproducible research can be found at www.datagrapple.com/Tech.

* Accepted at IEEE Workshop on Statistical Signal Processing (SSP 2016)

Via

Access Paper or Ask Questions

Exploring and measuring non-linear correlations: Copulas, Lightspeed Transportation and Clustering

Oct 30, 2016

Gautier Marti, Sebastien Andler, Frank Nielsen, Philippe Donnat

Figure 1 for Exploring and measuring non-linear correlations: Copulas, Lightspeed Transportation and Clustering

Figure 2 for Exploring and measuring non-linear correlations: Copulas, Lightspeed Transportation and Clustering

Figure 3 for Exploring and measuring non-linear correlations: Copulas, Lightspeed Transportation and Clustering

Figure 4 for Exploring and measuring non-linear correlations: Copulas, Lightspeed Transportation and Clustering

Abstract:We propose a methodology to explore and measure the pairwise correlations that exist between variables in a dataset. The methodology leverages copulas for encoding dependence between two variables, state-of-the-art optimal transport for providing a relevant geometry to the copulas, and clustering for summarizing the main dependence patterns found between the variables. Some of the clusters centers can be used to parameterize a novel dependence coefficient which can target or forget specific dependence patterns. Finally, we illustrate and benchmark the methodology on several datasets. Code and numerical experiments are available online for reproducible research.

Via

Access Paper or Ask Questions

Clustering Financial Time Series: How Long is Enough?

Apr 14, 2016

Gautier Marti, Sébastien Andler, Frank Nielsen, Philippe Donnat

Figure 1 for Clustering Financial Time Series: How Long is Enough?

Figure 2 for Clustering Financial Time Series: How Long is Enough?

Figure 3 for Clustering Financial Time Series: How Long is Enough?

Figure 4 for Clustering Financial Time Series: How Long is Enough?

Abstract:Researchers have used from 30 days to several years of daily returns as source data for clustering financial time series based on their correlations. This paper sets up a statistical framework to study the validity of such practices. We first show that clustering correlated random variables from their observed values is statistically consistent. Then, we also give a first empirical answer to the much debated question: How long should the time series be? If too short, the clusters found can be spurious; if too long, dynamics can be smoothed out.

* Accepted at IJCAI 2016

Via

Access Paper or Ask Questions

Optimal Copula Transport for Clustering Multivariate Time Series

Jan 11, 2016

Gautier Marti, Frank Nielsen, Philippe Donnat

Figure 1 for Optimal Copula Transport for Clustering Multivariate Time Series

Figure 2 for Optimal Copula Transport for Clustering Multivariate Time Series

Figure 3 for Optimal Copula Transport for Clustering Multivariate Time Series

Figure 4 for Optimal Copula Transport for Clustering Multivariate Time Series

Abstract:This paper presents a new methodology for clustering multivariate time series leveraging optimal transport between copulas. Copulas are used to encode both (i) intra-dependence of a multivariate time series, and (ii) inter-dependence between two time series. Then, optimal copula transport allows us to define two distances between multivariate time series: (i) one for measuring intra-dependence dissimilarity, (ii) another one for measuring inter-dependence dissimilarity based on a new multivariate dependence coefficient which is robust to noise, deterministic, and which can target specified dependencies.

* Accepted at ICASSP 2016

Via

Access Paper or Ask Questions