Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gaël Richard

S2A, IDS, LTCI, IP Paris

Diff-TONE: Timestep Optimization for iNstrument Editing in Text-to-Music Diffusion Models

Jun 18, 2025

Teysir Baoueb, Xiaoyu Bie, Xi Wang, Gaël Richard

Abstract:Breakthroughs in text-to-music generation models are transforming the creative landscape, equipping musicians with innovative tools for composition and experimentation like never before. However, controlling the generation process to achieve a specific desired outcome remains a significant challenge. Even a minor change in the text prompt, combined with the same random seed, can drastically alter the generated piece. In this paper, we explore the application of existing text-to-music diffusion models for instrument editing. Specifically, for an existing audio track, we aim to leverage a pretrained text-to-music diffusion model to edit the instrument while preserving the underlying content. Based on the insight that the model first focuses on the overall structure or content of the audio, then adds instrument information, and finally refines the quality, we show that selecting a well-chosen intermediate timestep, identified through an instrument classifier, yields a balance between preserving the original piece's content and achieving the desired timbre. Our method does not require additional training of the text-to-music diffusion model, nor does it compromise the generation process's speed.

Via

Access Paper or Ask Questions

AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder

Jan 09, 2025

Samir Sadok, Simon Leglaive, Laurent Girin, Gaël Richard, Xavier Alameda-Pineda

Abstract:This article introduces AnCoGen, a novel method that leverages a masked autoencoder to unify the analysis, control, and generation of speech signals within a single model. AnCoGen can analyze speech by estimating key attributes, such as speaker identity, pitch, content, loudness, signal-to-noise ratio, and clarity index. In addition, it can generate speech from these attributes and allow precise control of the synthesized speech by modifying them. Extensive experiments demonstrated the effectiveness of AnCoGen across speech analysis-resynthesis, pitch estimation, pitch modification, and speech enhancement.

* 5 pages, https://samsad35.github.io/site-ancogen

Via

Access Paper or Ask Questions

Multiple Choice Learning for Efficient Speech Separation with Many Speakers

Nov 27, 2024

David Perera, François Derrida, Théo Mariotte, Gaël Richard, Slim Essid

Abstract:Training speech separation models in the supervised setting raises a permutation problem: finding the best assignation between the model predictions and the ground truth separated signals. This inherently ambiguous task is customarily solved using Permutation Invariant Training (PIT). In this article, we instead consider using the Multiple Choice Learning (MCL) framework, which was originally introduced to tackle ambiguous tasks. We demonstrate experimentally on the popular WSJ0-mix and LibriMix benchmarks that MCL matches the performances of PIT, while being computationally advantageous. This opens the door to a promising research direction, as MCL can be naturally extended to handle a variable number of speakers, or to tackle speech separation in the unsupervised setting.

Via

Access Paper or Ask Questions

Episodic fine-tuning prototypical networks for optimization-based few-shot learning: Application to audio classification

Oct 04, 2024

Xuanyu Zhuang, Geoffroy Peeters, Gaël Richard

Abstract:The Prototypical Network (ProtoNet) has emerged as a popular choice in Few-shot Learning (FSL) scenarios due to its remarkable performance and straightforward implementation. Building upon such success, we first propose a simple (yet novel) method to fine-tune a ProtoNet on the (labeled) support set of the test episode of a C-way-K-shot test episode (without using the query set which is only used for evaluation). We then propose an algorithmic framework that combines ProtoNet with optimization-based FSL algorithms (MAML and Meta-Curvature) to work with such a fine-tuning method. Since optimization-based algorithms endow the target learner model with the ability to fast adaption to only a few samples, we utilize ProtoNet as the target model to enhance its fine-tuning performance with the help of a specifically designed episodic fine-tuning strategy. The experimental results confirm that our proposed models, MAML-Proto and MC-Proto, combined with our unique fine-tuning method, outperform regular ProtoNet by a large margin in few-shot audio classification tasks on the ESC-50 and Speech Commands v2 datasets. We note that although we have only applied our model to the audio domain, it is a general method and can be easily extended to other domains.

* 2024 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2024), Sep 2024, London (UK), United Kingdom
* Accepted at MLSP 2024

Via

Access Paper or Ask Questions

Using Random Codebooks for Audio Neural AutoEncoders

Sep 25, 2024

Benoît Giniès, Xiaoyu Bie, Olivier Fercoq, Gaël Richard

Figure 1 for Using Random Codebooks for Audio Neural AutoEncoders

Figure 2 for Using Random Codebooks for Audio Neural AutoEncoders

Figure 3 for Using Random Codebooks for Audio Neural AutoEncoders

Figure 4 for Using Random Codebooks for Audio Neural AutoEncoders

Abstract:Latent representation learning has been an active field of study for decades in numerous applications. Inspired among others by the tokenization from Natural Language Processing and motivated by the research of a simple data representation, recent works have introduced a quantization step into the feature extraction. In this work, we propose a novel strategy to build the neural discrete representation by means of random codebooks. These codebooks are obtained by randomly sampling a large, predefined fixed codebook. We experimentally show the merits and potential of our approach in a task of audio compression and reconstruction.

* EUROPEAN SIGNAL PROCESSING CONFERENCE 2024 [EUSIPCO], Aug 2024, Lyon, France

Via

Access Paper or Ask Questions

Learning Source Disentanglement in Neural Audio Codec

Sep 17, 2024

Xiaoyu Bie, Xubo Liu, Gaël Richard

Figure 1 for Learning Source Disentanglement in Neural Audio Codec

Figure 2 for Learning Source Disentanglement in Neural Audio Codec

Figure 3 for Learning Source Disentanglement in Neural Audio Codec

Figure 4 for Learning Source Disentanglement in Neural Audio Codec

Abstract:Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.

* project page: https://xiaoyubie1994.github.io/sdcodec/

Via

Access Paper or Ask Questions

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

Jul 22, 2024

David Perera, Victor Letzelter, Théo Mariotte, Adrien Cortés, Mickael Chen, Slim Essid, Gaël Richard

Figure 1 for Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

Figure 2 for Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

Figure 3 for Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

Figure 4 for Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

Abstract:We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets, on the standard UCI benchmark, and on speech separation.

Via

Access Paper or Ask Questions

Speech dereverberation constrained on room impulse response characteristics

Jul 10, 2024

Louis Bahrman, Mathieu Fontaine, Jonathan Le Roux, Gaël Richard

Abstract:Single-channel speech dereverberation aims at extracting a dry speech signal from a recording affected by the acoustic reflections in a room. However, most current deep learning-based approaches for speech dereverberation are not interpretable for room acoustics, and can be considered as black-box systems in that regard. In this work, we address this problem by regularizing the training loss using a novel physical coherence loss which encourages the room impulse response (RIR) induced by the dereverberated output of the model to match the acoustic properties of the room in which the signal was recorded. Our investigation demonstrates the preservation of the original dereverberated signal alongside the provision of a more physically coherent RIR.

* INTERSPEECH, Sep 2024, Kos Island, Greece

Via

Access Paper or Ask Questions

Structure-informed Positional Encoding for Music Generation

Feb 28, 2024

Manvi Agarwal, Changhong Wang, Gaël Richard

Figure 1 for Structure-informed Positional Encoding for Music Generation

Figure 2 for Structure-informed Positional Encoding for Music Generation

Figure 3 for Structure-informed Positional Encoding for Music Generation

Abstract:Music generated by deep learning methods often suffers from a lack of coherence and long-term organization. Yet, multi-scale hierarchical structure is a distinctive feature of music signals. To leverage this information, we propose a structure-informed positional encoding framework for music generation with Transformers. We design three variants in terms of absolute, relative and non-stationary positional information. We comprehensively test them on two symbolic music generation tasks: next-timestep prediction and accompaniment generation. As a comparison, we choose multiple baselines from the literature and demonstrate the merits of our methods using several musically-motivated evaluation metrics. In particular, our methods improve the melodic and structural consistency of the generated pieces.

* IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024, Seoul, South Korea

Via

Access Paper or Ask Questions

Unsupervised Harmonic Parameter Estimation Using Differentiable DSP and Spectral Optimal Transport

Jan 15, 2024

Bernardo Torres, Geoffroy Peeters, Gaël Richard

Abstract:In neural audio signal processing, pitch conditioning has been used to enhance the performance of synthesizers. However, jointly training pitch estimators and synthesizers is a challenge when using standard audio-to-audio reconstruction loss, leading to reliance on external pitch trackers. To address this issue, we propose using a spectral loss function inspired by optimal transportation theory that minimizes the displacement of spectral energy. We validate this approach through an unsupervised autoencoding task that fits a harmonic template to harmonic signals. We jointly estimate the fundamental frequency and amplitudes of harmonics using a lightweight encoder and reconstruct the signals using a differentiable harmonic synthesizer. The proposed approach offers a promising direction for improving unsupervised parameter estimation in neural audio applications.

* IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul, South Korea
* Accepted in ICASSP 2024

Via

Access Paper or Ask Questions