Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nicholas J. Bryan

DRAGON: Distributional Rewards Optimize Diffusion Generative Models

Apr 21, 2025

Yatong Bai, Jonah Casebeer, Somayeh Sojoudi, Nicholas J. Bryan

Abstract:We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modality encoders such as CLAP are used, the reference examples may be of a different modality (e.g., text versus audio). Then, DRAGON gathers online and on-policy generations, scores them to construct a positive demonstration set and a negative set, and leverages the contrast between the two sets to maximize the reward. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 different reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets indeed enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. As such, DRAGON exhibits a new approach to designing and optimizing reward functions for improving human-perceived quality. Sound examples at https://ml-dragon.github.io/web.

Via

Access Paper or Ask Questions

Presto! Distilling Steps and Layers for Accelerating Music Generation

Oct 07, 2024

Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan

Figure 1 for Presto! Distilling Steps and Layers for Accelerating Music Generation

Figure 2 for Presto! Distilling Steps and Layers for Accelerating Music Generation

Figure 3 for Presto! Distilling Steps and Layers for Accelerating Music Generation

Figure 4 for Presto! Distilling Steps and Layers for Accelerating Music Generation

Abstract:Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at https://presto-music.github.io/web/.

Via

Access Paper or Ask Questions

MusicHiFi: Fast High-Fidelity Stereo Vocoding

Mar 20, 2024

Ge Zhu, Juan-Pablo Caceres, Zhiyao Duan, Nicholas J. Bryan

Figure 1 for MusicHiFi: Fast High-Fidelity Stereo Vocoding

Figure 2 for MusicHiFi: Fast High-Fidelity Stereo Vocoding

Figure 3 for MusicHiFi: Fast High-Fidelity Stereo Vocoding

Figure 4 for MusicHiFi: Fast High-Fidelity Stereo Vocoding

Abstract:Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at https://MusicHiFi.github.io/web/.

Via

Access Paper or Ask Questions

Scaling Up Adaptive Filter Optimizers

Mar 01, 2024

Jonah Casebeer, Nicholas J. Bryan, Paris Smaragdis

Figure 1 for Scaling Up Adaptive Filter Optimizers

Figure 2 for Scaling Up Adaptive Filter Optimizers

Figure 3 for Scaling Up Adaptive Filter Optimizers

Abstract:We introduce a new online adaptive filtering method called supervised multi-step adaptive filters (SMS-AF). Our method uses neural networks to control or optimize linear multi-delay or multi-channel frequency-domain filters and can flexibly scale-up performance at the cost of increased compute -- a property rarely addressed in the AF literature, but critical for many applications. To do so, we extend recent work with a set of improvements including feature pruning, a supervised loss, and multiple optimization steps per time-frame. These improvements work in a cohesive manner to unlock scaling. Furthermore, we show how our method relates to Kalman filtering and meta-adaptive filtering, making it seamlessly applicable to a diverse set of AF tasks. We evaluate our method on acoustic echo cancellation (AEC) and multi-channel speech enhancement tasks and compare against several baselines on standard synthetic and real-world datasets. Results show our method performance scales with inference cost and model capacity, yields multi-dB performance gains for both tasks, and is real-time capable on a single CPU core.

Via

Access Paper or Ask Questions

DITTO: Diffusion Inference-Time T-Optimization for Music Generation

Jan 22, 2024

Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan

Abstract:We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-work for controlling pre-trained text-to-music diffusion models at inference-time via optimizing initial noise latents. Our method can be used to optimize through any differentiable feature matching loss to achieve a target (stylized) output and leverages gradient checkpointing for memory efficiency. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control - all without ever fine-tuning the underlying model. When we compare our approach against related training, guidance, and optimization-based methods, we find DITTO achieves state-of-the-art performance on nearly all tasks, including outperforming comparable approaches on controllability, audio quality, and computational efficiency, thus opening the door for high-quality, flexible, training-free control of diffusion models. Sound examples can be found at https://DITTO-Music.github.io/web/.

Via

Access Paper or Ask Questions

Music ControlNet: Multiple Time-varying Controls for Music Generation

Nov 13, 2023

Shih-Lun Wu, Chris Donahue, Shinji Watanabe, Nicholas J. Bryan

Figure 1 for Music ControlNet: Multiple Time-varying Controls for Music Generation

Figure 2 for Music ControlNet: Multiple Time-varying Controls for Music Generation

Figure 3 for Music ControlNet: Multiple Time-varying Controls for Music Generation

Figure 4 for Music ControlNet: Multiple Time-varying Controls for Music Generation

Abstract:Text-to-music generation models are now capable of generating high-quality music audio in broad styles. However, text control is primarily suitable for the manipulation of global musical attributes like genre, mood, and tempo, and is less suitable for precise control over time-varying attributes such as the positions of beats in time or the changing dynamics of the music. We propose Music ControlNet, a diffusion-based music generation model that offers multiple precise, time-varying controls over generated audio. To imbue text-to-music models with time-varying control, we propose an approach analogous to pixel-wise control of the image-domain ControlNet method. Specifically, we extract controls from training audio yielding paired data, and fine-tune a diffusion-based conditional generative model over audio spectrograms given melody, dynamics, and rhythm controls. While the image-domain Uni-ControlNet method already allows generation with any subset of controls, we devise a new strategy to allow creators to input controls that are only partially specified in time. We evaluate both on controls extracted from audio and controls we expect creators to provide, demonstrating that we can generate realistic music that corresponds to control inputs in both settings. While few comparable music generation models exist, we benchmark against MusicGen, a recent model that accepts text and melody input, and show that our model generates music that is 49% more faithful to input melodies despite having 35x fewer parameters, training on 11x less data, and enabling two additional forms of time-varying control. Sound examples can be found at https://MusicControlNet.github.io/web/.

* 11 pages, 4 figure, 5 tables, Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

Via

Access Paper or Ask Questions

Meta-Learning for Adaptive Filters with Higher-Order Frequency Dependencies

Sep 20, 2022

Junkai Wu, Jonah Casebeer, Nicholas J. Bryan, Paris Smaragdis

Figure 1 for Meta-Learning for Adaptive Filters with Higher-Order Frequency Dependencies

Figure 2 for Meta-Learning for Adaptive Filters with Higher-Order Frequency Dependencies

Figure 3 for Meta-Learning for Adaptive Filters with Higher-Order Frequency Dependencies

Figure 4 for Meta-Learning for Adaptive Filters with Higher-Order Frequency Dependencies

Abstract:Adaptive filters are applicable to many signal processing tasks including acoustic echo cancellation, beamforming, and more. Adaptive filters are typically controlled using algorithms such as least-mean squares(LMS), recursive least squares(RLS), or Kalman filter updates. Such models are often applied in the frequency domain, assume frequency independent processing, and do not exploit higher-order frequency dependencies, for simplicity. Recent work on meta-adaptive filters, however, has shown that we can control filter adaptation using neural networks without manual derivation, motivating new work to exploit such information. In this work, we present higher-order meta-adaptive filters, a key improvement to meta-adaptive filters that incorporates higher-order frequency dependencies. We demonstrate our approach on acoustic echo cancellation and develop a family of filters that yield multi-dB improvements over competitive baselines, and are at least an order-of-magnitude less complex. Moreover, we show our improvements hold with or without a downstream speech enhancer.

* Source code and audio examples: https://jmcasebeer.github.io/metaaf/higher-order

Via

Access Paper or Ask Questions

Style Transfer of Audio Effects with Differentiable Signal Processing

Jul 18, 2022

Christian J. Steinmetz, Nicholas J. Bryan, Joshua D. Reiss

Figure 1 for Style Transfer of Audio Effects with Differentiable Signal Processing

Figure 2 for Style Transfer of Audio Effects with Differentiable Signal Processing

Figure 3 for Style Transfer of Audio Effects with Differentiable Signal Processing

Figure 4 for Style Transfer of Audio Effects with Differentiable Signal Processing

Abstract:We present a framework that can impose the audio effects and production style from one recording to another by example with the goal of simplifying the audio production process. We train a deep neural network to analyze an input recording and a style reference recording, and predict the control parameters of audio effects used to render the output. In contrast to past work, we integrate audio effects as differentiable operators in our framework, perform backpropagation through audio effects, and optimize end-to-end using an audio-domain loss. We use a self-supervised training strategy enabling automatic control of audio effects without the use of any labeled or paired training data. We survey a range of existing and new approaches for differentiable signal processing, showing how each can be integrated into our framework while discussing their trade-offs. We evaluate our approach on both speech and music tasks, demonstrating that our approach generalizes both to unseen recordings and even to sample rates different than those seen during training. Our approach produces convincing production style transfer results with the ability to transform input recordings to produced recordings, yielding audio effect control parameters that enable interpretability and user interaction.

* Preprint. To appear in the Journal of the Audio Engineering Society

Via

Access Paper or Ask Questions

Meta-AF: Meta-Learning for Adaptive Filters

Apr 25, 2022

Jonah Casebeer, Nicholas J. Bryan, Paris Smaragdis

Figure 1 for Meta-AF: Meta-Learning for Adaptive Filters

Figure 2 for Meta-AF: Meta-Learning for Adaptive Filters

Figure 3 for Meta-AF: Meta-Learning for Adaptive Filters

Figure 4 for Meta-AF: Meta-Learning for Adaptive Filters

Abstract:Adaptive filtering algorithms are pervasive throughout modern society and have had a significant impact on a wide variety of domains including audio processing, telecommunications, biomedical sensing, astropyhysics and cosmology, seismology, and many more. Adaptive filters typically operate via specialized online, iterative optimization methods such as least-mean squares or recursive least squares and aim to process signals in unknown or nonstationary environments. Such algorithms, however, can be slow and laborious to develop, require domain expertise to create, and necessitate mathematical insight for improvement. In this work, we seek to go beyond the limits of human-derived adaptive filter algorithms and present a comprehensive framework for learning online, adaptive signal processing algorithms or update rules directly from data. To do so, we frame the development of adaptive filters as a meta-learning problem in the context of deep learning and use a form of self-supervision to learn online iterative update rules for adaptive filters. To demonstrate our approach, we focus on audio applications and systematically develop meta-learned adaptive filters for five canonical audio problems including system identification, acoustic echo cancellation, blind equalization, multi-channel dereverberation, and beamforming. For each application, we compare against common baselines and/or current state-of-the-art methods and show we can learn high-performing adaptive filters that operate in real-time and, in most cases, significantly out perform all past specially developed methods for each task using a single general-purpose configuration of our method.

* Source code and audio examples: https://jmcasebeer.github.io/projects/metaaf

Via

Access Paper or Ask Questions

Emotion Embedding Spaces for Matching Music to Stories

Nov 26, 2021

Minz Won, Justin Salamon, Nicholas J. Bryan, Gautham J. Mysore, Xavier Serra

Figure 1 for Emotion Embedding Spaces for Matching Music to Stories

Figure 2 for Emotion Embedding Spaces for Matching Music to Stories

Figure 3 for Emotion Embedding Spaces for Matching Music to Stories

Figure 4 for Emotion Embedding Spaces for Matching Music to Stories

Abstract:Content creators often use music to enhance their stories, as it can be a powerful tool to convey emotion. In this paper, our goal is to help creators find music to match the emotion of their story. We focus on text-based stories that can be auralized (e.g., books), use multiple sentences as input queries, and automatically retrieve matching music. We formalize this task as a cross-modal text-to-music retrieval problem. Both the music and text domains have existing datasets with emotion labels, but mismatched emotion vocabularies prevent us from using mood or emotion annotations directly for matching. To address this challenge, we propose and investigate several emotion embedding spaces, both manually defined (e.g., valence/arousal) and data-driven (e.g., Word2Vec and metric learning) to bridge this gap. Our experiments show that by leveraging these embedding spaces, we are able to successfully bridge the gap between modalities to facilitate cross modal retrieval. We show that our method can leverage the well established valence-arousal space, but that it can also achieve our goal via data-driven embedding spaces. By leveraging data-driven embeddings, our approach has the potential of being generalized to other retrieval tasks that require broader or completely different vocabularies.

* International Society for Music Information Retrieval (ISMIR) 2021, Best Student Paper

Via

Access Paper or Ask Questions