Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Slim Essid

IP Paris, IDS, S2A

Perceptual Noise-Masking with Music through Deep Spectral Envelope Shaping

Feb 24, 2025

Clémentine Berger, Roland Badeau, Slim Essid

Abstract:People often listen to music in noisy environments, seeking to isolate themselves from ambient sounds. Indeed, a music signal can mask some of the noise's frequency components due to the effect of simultaneous masking. In this article, we propose a neural network based on a psychoacoustic masking model, designed to enhance the music's ability to mask ambient noise by reshaping its spectral envelope with predicted filter frequency responses. The model is trained with a perceptual loss function that balances two constraints: effectively masking the noise while preserving the original music mix and the user's chosen listening level. We evaluate our approach on simulated data replicating a user's experience of listening to music with headphones in a noisy environment. The results, based on defined objective metrics, demonstrate that our system improves the state of the art.

* IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Apr 2025, Hyderabad, India

Via

Access Paper or Ask Questions

Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

Feb 17, 2025

Aurian Quelennec, Pierre Chouteau, Geoffroy Peeters, Slim Essid

Abstract:Recently, self-supervised learning methods based on masked latent prediction have proven to encode input data into powerful representations. However, during training, the learned latent space can be further transformed to extract higher-level information that could be more suited for downstream classification tasks. Therefore, we propose a new method: MAsked latenT Prediction And Classification (MATPAC), which is trained with two pretext tasks solved jointly. As in previous work, the first pretext task is a masked latent prediction task, ensuring a robust input representation in the latent space. The second one is unsupervised classification, which utilises the latent representations of the first pretext task to match probability distributions between a teacher and a student. We validate the MATPAC method by comparing it to other state-of-the-art proposals and conducting ablations studies. MATPAC reaches state-of-the-art self-supervised learning results on reference audio classification datasets such as OpenMIC, GTZAN, ESC-50 and US8K and outperforms comparable supervised methods results for musical auto-tagging on Magna-tag-a-tune.

* Copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Via

Access Paper or Ask Questions

TACO: Training-free Sound Prompted Segmentation via Deep Audio-visual CO-factorization

Dec 02, 2024

Hugo Malard, Michel Olvera, Stephane Lathuiliere, Slim Essid

Abstract:Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models to reveal shared interpretable concepts. These concepts are passed to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.

Via

Access Paper or Ask Questions

Multiple Choice Learning for Efficient Speech Separation with Many Speakers

Nov 27, 2024

David Perera, François Derrida, Théo Mariotte, Gaël Richard, Slim Essid

Abstract:Training speech separation models in the supervised setting raises a permutation problem: finding the best assignation between the model predictions and the ground truth separated signals. This inherently ambiguous task is customarily solved using Permutation Invariant Training (PIT). In this article, we instead consider using the Multiple Choice Learning (MCL) framework, which was originally introduced to tackle ambiguous tasks. We demonstrate experimentally on the popular WSJ0-mix and LibriMix benchmarks that MCL matches the performances of PIT, while being computationally advantageous. This opens the door to a promising research direction, as MCL can be naturally extended to handle a variable number of speakers, or to tackle speech separation in the unsupervised setting.

Via

Access Paper or Ask Questions

A Contrastive Self-Supervised Learning scheme for beat tracking amenable to few-shot learning

Nov 06, 2024

Antonin Gagnere, Geoffroy Peeters, Slim Essid

Abstract:In this paper, we propose a novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking. Taking inspiration from the Contrastive Predictive Coding paradigm, we propose to train a Log-Mel-Spectrogram Transformer encoder to contrast observations at times separated by hypothesized beat intervals from those that are not. We do this without the knowledge of ground-truth tempo or beat positions, as we rely on the local maxima of a Predominant Local Pulse function, considered as a proxy for Tatum positions, to define candidate anchors, candidate positives (located at a distance of a power of two from the anchor) and negatives (remaining time positions). We show that a model pre-trained using this approach on the unlabeled FMA, MTT and MTG-Jamendo datasets can successfully be fine-tuned in the few-shot regime, i.e. with just a few annotated examples to get a competitive beat-tracking performance.

* ISMIR 2024, Nov 2024, San Francisco, Californ, United States

Via

Access Paper or Ask Questions

An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

Oct 08, 2024

Hugo Malard, Michel Olvera, Stéphane Lathuiliere, Slim Essid

Figure 1 for An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

Figure 2 for An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

Figure 3 for An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

Figure 4 for An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

Abstract:Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts. In this work, we show that this ability can be re-purposed for audio captioning, where the joint image-language decoder can be leveraged to describe auditory content associated with image sequences within videos featuring audiovisual content. This can be achieved via multimodal alignment. Yet, this multimodal alignment task is non-trivial due to the inherent disparity between audible and visible elements in real-world videos. Moreover, multimodal representation learning often relies on contrastive learning, facing the challenge of the so-called modality gap which hinders smooth integration between modalities. In this work, we introduce a novel methodology for bridging the audiovisual modality gap by matching the distributions of tokens produced by an audio backbone and those of an image captioner. Our approach aligns the audio token distribution with that of the image tokens, enabling the model to perform zero-shot audio captioning in an unsupervised fashion while keeping the initial image captioning component unaltered. This alignment allows for the use of either audio or audiovisual input by combining or substituting the image encoder with the aligned audio encoder. Our method achieves significantly improved performances in zero-shot audio captioning, compared to existing approaches.

Via

Access Paper or Ask Questions

SALT: Standardized Audio event Label Taxonomy

Sep 18, 2024

Paraskevas Stamatiadis, Michel Olvera, Slim Essid

Figure 1 for SALT: Standardized Audio event Label Taxonomy

Figure 2 for SALT: Standardized Audio event Label Taxonomy

Figure 3 for SALT: Standardized Audio event Label Taxonomy

Figure 4 for SALT: Standardized Audio event Label Taxonomy

Abstract:Machine listening systems often rely on fixed taxonomies to organize and label audio data, key for training and evaluating deep neural networks (DNNs) and other supervised algorithms. However, such taxonomies face significant constraints: they are composed of application-dependent predefined categories, which hinders the integration of new or varied sounds, and exhibits limited cross-dataset compatibility due to inconsistent labeling standards. To overcome these limitations, we introduce SALT: Standardized Audio event Label Taxonomy. Building upon the hierarchical structure of AudioSet's ontology, our taxonomy extends and standardizes labels across 24 publicly available environmental sound datasets, allowing the mapping of class labels from diverse datasets to a unified system. Our proposal comes with a new Python package designed for navigating and utilizing this taxonomy, easing cross-dataset label searching and hierarchical exploration. Notably, our package allows effortless data aggregation from diverse sources, hence easy experimentation with combined datasets.

* DCASE, Oct 2024, Tokyo, Japan

Via

Access Paper or Ask Questions

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

Jul 22, 2024

David Perera, Victor Letzelter, Théo Mariotte, Adrien Cortés, Mickael Chen, Slim Essid, Gaël Richard

Figure 1 for Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

Figure 2 for Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

Figure 3 for Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

Figure 4 for Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

Abstract:We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets, on the standard UCI benchmark, and on speech separation.

Via

Access Paper or Ask Questions

Less Forgetting for Better Generalization: Exploring Continual-learning Fine-tuning Methods for Speech Self-supervised Representations

Jun 30, 2024

Salah Zaiem, Titouan Parcollet, Slim Essid

Abstract:Despite being trained on massive and diverse datasets, speech self-supervised encoders are generally used for downstream purposes as mere frozen feature extractors or model initializers before fine-tuning. The former severely limits the exploitation of large encoders, while the latter hurts the robustness acquired during pretraining, especially in low-resource scenarios. This work explores middle-ground solutions, conjecturing that reducing the forgetting of the self-supervised task during the downstream fine-tuning leads to better generalization. To prove this, focusing on speech recognition, we benchmark different continual-learning approaches during fine-tuning and show that they improve both in-domain and out-of-domain generalization abilities. Relative performance gains reach 15.7% and 22.5% with XLSR used as the encoder on two English and Danish speech recognition tasks. Further probing experiments show that these gains are indeed linked to less forgetting.

* 5 Pages

Via

Access Paper or Ask Questions

Winner-takes-all learners are geometry-aware conditional density estimators

Jun 07, 2024

Victor Letzelter, David Perera, Cédric Rommel, Mathieu Fontaine, Slim Essid, Gael Richard, Patrick Pérez

Figure 1 for Winner-takes-all learners are geometry-aware conditional density estimators

Figure 2 for Winner-takes-all learners are geometry-aware conditional density estimators

Figure 3 for Winner-takes-all learners are geometry-aware conditional density estimators

Figure 4 for Winner-takes-all learners are geometry-aware conditional density estimators

Abstract:Winner-takes-all training is a simple learning paradigm, which handles ambiguous tasks by predicting a set of plausible hypotheses. Recently, a connection was established between Winner-takes-all training and centroidal Voronoi tessellations, showing that, once trained, hypotheses should quantize optimally the shape of the conditional distribution to predict. However, the best use of these hypotheses for uncertainty quantification is still an open question.In this work, we show how to leverage the appealing geometric properties of the Winner-takes-all learners for conditional density estimation, without modifying its original training scheme. We theoretically establish the advantages of our novel estimator both in terms of quantization and density estimation, and we demonstrate its competitiveness on synthetic and real-world datasets, including audio data.

* International Conference on Machine Learning, Jul 2024, Vienne (Autriche), Austria

Via

Access Paper or Ask Questions