Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Javier Nistal

LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation

Jun 13, 2025

Tom Baker, Javier Nistal

Abstract:Text-to-audio diffusion models produce high-quality and diverse music but many, if not most, of the SOTA models lack the fine-grained, time-varying controls essential for music production. ControlNet enables attaching external controls to a pre-trained generative model by cloning and fine-tuning its encoder on new conditionings. However, this approach incurs a large memory footprint and restricts users to a fixed set of controls. We propose a lightweight, modular architecture that considerably reduces parameter count while matching ControlNet in audio quality and condition adherence. Our method offers greater flexibility and significantly lower memory usage, enabling more efficient training and deployment of independent controls. We conduct extensive objective and subjective evaluations and provide numerous audio examples on the accompanying website at https://lightlatentcontrol.github.io

* Accepted at ISMIR 2025

Via

Access Paper or Ask Questions

Accompaniment Prompt Adherence: A Measure for Evaluating Music Accompaniment Systems

Mar 08, 2025

Maarten Grachten, Javier Nistal

Abstract:Generative systems of musical accompaniments are rapidly growing, yet there are no standardized metrics to evaluate how well generations align with the conditional audio prompt. We introduce a distribution-based measure called "Accompaniment Prompt Adherence" (APA), and validate it through objective experiments on synthetic data perturbations, and human listening tests. Results show that APA aligns well with human judgments of adherence and is discriminative to transformations that degrade adherence. We release a Python implementation of the metric using the widely adopted pre-trained CLAP embedding model, offering a valuable tool for evaluating and comparing accompaniment generation systems.

* Accepted for publication at the 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

Via

Access Paper or Ask Questions

Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation

Nov 27, 2024

Marco Pasini, Javier Nistal, Stefan Lattner, George Fazekas

Abstract:Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also feasible. However, such Continuous Autoregressive Models (CAMs) can suffer from a decline in generation quality over extended sequences due to error accumulation during inference. We introduce a novel method to address this issue by injecting random noise into the input embeddings during training. This procedure makes the model robust against varying error levels at inference. We further reduce error accumulation through an inference procedure that introduces low-level noise. Experiments on musical audio generation show that CAM substantially outperforms existing autoregressive and non-autoregressive approaches while preserving audio quality over extended sequences. This work paves the way for generating continuous embeddings in a purely autoregressive setting, opening new possibilities for real-time and interactive generative applications.

* Accepted to NeurIPS 2024 - Audio Imagination Workshop

Via

Access Paper or Ask Questions

Improving Musical Accompaniment Co-creation via Diffusion Transformers

Oct 30, 2024

Javier Nistal, Marco Pasini, Stefan Lattner

Abstract:Building upon Diff-A-Riff, a latent diffusion model for musical instrument accompaniment generation, we present a series of improvements targeting quality, diversity, inference speed, and text-driven control. First, we upgrade the underlying autoencoder to a stereo-capable model with superior fidelity and replace the latent U-Net with a Diffusion Transformer. Additionally, we refine text prompting by training a cross-modality predictive network to translate text-derived CLAP embeddings to audio-derived CLAP embeddings. Finally, we improve inference speed by training the latent model using a consistency framework, achieving competitive quality with fewer denoising steps. Our model is evaluated against the original Diff-A-Riff variant using objective metrics in ablation experiments, demonstrating promising advancements in all targeted areas. Sound examples are available at: https://sonycslparis.github.io/improved_dar/.

* 5 pages; 1 table

Via

Access Paper or Ask Questions

Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Jun 12, 2024

Javier Nistal, Marco Pasini, Cyran Aouameur, Maarten Grachten, Stefan Lattner

Figure 1 for Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Figure 2 for Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Figure 3 for Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Figure 4 for Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Abstract:Recent advancements in deep generative models present new opportunities for music production but also pose challenges, such as high computational demands and limited audio quality. Moreover, current systems frequently rely solely on text input and typically focus on producing complete musical pieces, which is incompatible with existing workflows in music production. To address these issues, we introduce "Diff-A-Riff," a Latent Diffusion Model designed to generate high-quality instrumental accompaniments adaptable to any musical context. This model offers control through either audio references, text prompts, or both, and produces 48kHz pseudo-stereo audio while significantly reducing inference time and memory usage. We demonstrate the model's capabilities through objective metrics and subjective listening tests, with extensive examples available on the accompanying website: sonycslparis.github.io/diffariff-companion/

* 8 pages, 2 figures, 3 tables

Via

Access Paper or Ask Questions

Stochastic Restoration of Heavily Compressed Musical Audio using Generative Adversarial Networks

Jul 04, 2022

Stefan Lattner, Javier Nistal

Figure 1 for Stochastic Restoration of Heavily Compressed Musical Audio using Generative Adversarial Networks

Figure 2 for Stochastic Restoration of Heavily Compressed Musical Audio using Generative Adversarial Networks

Figure 3 for Stochastic Restoration of Heavily Compressed Musical Audio using Generative Adversarial Networks

Figure 4 for Stochastic Restoration of Heavily Compressed Musical Audio using Generative Adversarial Networks

Abstract:Lossy audio codecs compress (and decompress) digital audio streams by removing information that tends to be inaudible in human perception. Under high compression rates, such codecs may introduce a variety of impairments in the audio signal. Many works have tackled the problem of audio enhancement and compression artifact removal using deep learning techniques. However, only a few works tackle the restoration of heavily compressed audio signals in the musical domain. In such a scenario, there is no unique solution for the restoration of the original signal. Therefore, in this study, we test a stochastic generator of a Generative Adversarial Network (GAN) architecture for this task. Such a stochastic generator, conditioned on highly compressed musical audio signals, could one day generate outputs indistinguishable from high-quality releases. Therefore, the present study may yield insights into more efficient musical data storage and transmission. We train stochastic and deterministic generators on MP3-compressed audio signals with 16, 32, and 64 kbit/s. We perform an extensive evaluation of the different experiments utilizing objective metrics and listening tests. We find that the models can improve the quality of the audio signals over the MP3 versions for 16 and 32 kbit/s and that the stochastic generators are capable of generating outputs that are closer to the original signals than those of the deterministic generators.

* MDPI Electronics 2021, 10, 1349
* 21 pages, 5 figures, published in MDPI Electronics Special Issue "Machine Learning Applied to Music/Audio Signal Processing"

Via

Access Paper or Ask Questions

DrumGAN VST: A Plugin for Drum Sound Analysis/Synthesis With Autoencoding Generative Adversarial Networks

Jun 29, 2022

Javier Nistal, Cyran Aouameur, Ithan Velarde, Stefan Lattner

Figure 1 for DrumGAN VST: A Plugin for Drum Sound Analysis/Synthesis With Autoencoding Generative Adversarial Networks

Figure 2 for DrumGAN VST: A Plugin for Drum Sound Analysis/Synthesis With Autoencoding Generative Adversarial Networks

Figure 3 for DrumGAN VST: A Plugin for Drum Sound Analysis/Synthesis With Autoencoding Generative Adversarial Networks

Figure 4 for DrumGAN VST: A Plugin for Drum Sound Analysis/Synthesis With Autoencoding Generative Adversarial Networks

Abstract:In contemporary popular music production, drum sound design is commonly performed by cumbersome browsing and processing of pre-recorded samples in sound libraries. One can also use specialized synthesis hardware, typically controlled through low-level, musically meaningless parameters. Today, the field of Deep Learning offers methods to control the synthesis process via learned high-level features and allows generating a wide variety of sounds. In this paper, we present DrumGAN VST, a plugin for synthesizing drum sounds using a Generative Adversarial Network. DrumGAN VST operates on 44.1 kHz sample-rate audio, offers independent and continuous instrument class controls, and features an encoding neural network that maps sounds into the GAN's latent space, enabling resynthesis and manipulation of pre-existing drum sounds. We provide numerous sound examples and a demo of the proposed VST plugin.

* 7 pages, 2 figures, 3 tables, ICML2022 Machine Learning for Audio Synthesis (MLAS) Workshop, for sound examples visit https://cslmusicteam.sony.fr/drumgan-vst/

Via

Access Paper or Ask Questions

DarkGAN: Exploiting Knowledge Distillation for Comprehensible Audio Synthesis with GANs

Aug 03, 2021

Javier Nistal, Stefan Lattner, Gaël Richard

Figure 1 for DarkGAN: Exploiting Knowledge Distillation for Comprehensible Audio Synthesis with GANs

Figure 2 for DarkGAN: Exploiting Knowledge Distillation for Comprehensible Audio Synthesis with GANs

Figure 3 for DarkGAN: Exploiting Knowledge Distillation for Comprehensible Audio Synthesis with GANs

Figure 4 for DarkGAN: Exploiting Knowledge Distillation for Comprehensible Audio Synthesis with GANs

Abstract:Generative Adversarial Networks (GANs) have achieved excellent audio synthesis quality in the last years. However, making them operable with semantically meaningful controls remains an open challenge. An obvious approach is to control the GAN by conditioning it on metadata contained in audio datasets. Unfortunately, audio datasets often lack the desired annotations, especially in the musical domain. A way to circumvent this lack of annotations is to generate them, for example, with an automatic audio-tagging system. The output probabilities of such systems (so-called "soft labels") carry rich information about the characteristics of the respective audios and can be used to distill the knowledge from a teacher model into a student model. In this work, we perform knowledge distillation from a large audio tagging system into an adversarial audio synthesizer that we call DarkGAN. Results show that DarkGAN can synthesize musical audio with acceptable quality and exhibits moderate attribute control even with out-of-distribution input conditioning. We release the code and provide audio examples on the accompanying website.

* 22nd International Society for Music Information Retrieval (ISMIR 2021)
* 9 pages, 3 figures, 2 tables, accepted to ISMIR2021

Via

Access Paper or Ask Questions

VQCPC-GAN: Variable-length Adversarial Audio Synthesis using Vector-Quantized Contrastive Predictive Coding

May 04, 2021

Javier Nistal, Cyran Aouameur, Stefan Lattner, Gaël Richard

Figure 1 for VQCPC-GAN: Variable-length Adversarial Audio Synthesis using Vector-Quantized Contrastive Predictive Coding

Figure 2 for VQCPC-GAN: Variable-length Adversarial Audio Synthesis using Vector-Quantized Contrastive Predictive Coding

Abstract:Influenced by the field of Computer Vision, Generative Adversarial Networks (GANs) are often adopted for the audio domain using fixed-size two-dimensional spectrogram representations as the "image data". However, in the (musical) audio domain, it is often desired to generate output of variable duration. This paper presents VQCPC-GAN, an adversarial framework for synthesizing variable-length audio by exploiting Vector-Quantized Contrastive Predictive Coding (VQCPC). A sequence of VQCPC tokens extracted from real audio data serves as conditional input to a GAN architecture, providing step-wise time-dependent features of the generated content. The input noise z (characteristic in adversarial architectures) remains fixed over time, ensuring temporal consistency of global features. We evaluate the proposed model by comparing a diverse set of metrics against various strong baselines. Results show that, even though the baselines score best, VQCPC-GAN achieves comparable performance even when generating variable-length audio. Numerous sound examples are provided in the accompanying website, and we release the code for reproducibility.

* 5 pages, 1 figure, 1 table; under review for WASPAA 2021

Via

Access Paper or Ask Questions