Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Patrick O'Reilly

Deep Audio Watermarks are Shallow: Limitations of Post-Hoc Watermarking Techniques for Speech

Apr 15, 2025

Patrick O'Reilly, Zeyu Jin, Jiaqi Su, Bryan Pardo

Abstract:In the audio modality, state-of-the-art watermarking methods leverage deep neural networks to allow the embedding of human-imperceptible signatures in generated audio. The ideal is to embed signatures that can be detected with high accuracy when the watermarked audio is altered via compression, filtering, or other transformations. Existing audio watermarking techniques operate in a post-hoc manner, manipulating "low-level" features of audio recordings after generation (e.g. through the addition of a low-magnitude watermark signal). We show that this post-hoc formulation makes existing audio watermarks vulnerable to transformation-based removal attacks. Focusing on speech audio, we (1) unify and extend existing evaluations of the effect of audio transformations on watermark detectability, and (2) demonstrate that state-of-the-art post-hoc audio watermarks can be removed with no knowledge of the watermarking scheme and minimal degradation in audio quality.

* ICLR 2025 Workshop on GenAI Watermarking

Via

Access Paper or Ask Questions

HARP 2.0: Expanding Hosted, Asynchronous, Remote Processing for Deep Learning in the DAW

Mar 04, 2025

Christodoulos Benetatos, Frank Cwitkowitz, Nathan Pruyne, Hugo Flores Garcia, Patrick O'Reilly, Zhiyao Duan, Bryan Pardo

Abstract:HARP 2.0 brings deep learning models to digital audio workstation (DAW) software through hosted, asynchronous, remote processing, allowing users to route audio from a plug-in interface through any compatible Gradio endpoint to perform arbitrary transformations. HARP renders endpoint-defined controls and processed audio in-plugin, meaning users can explore a variety of cutting-edge deep learning models without ever leaving the DAW. In the 2.0 release we introduce support for MIDI-based models and audio/MIDI labeling models, provide a streamlined pyharp Python API for model developers, and implement numerous interface and stability improvements. Through this work, we hope to bridge the gap between model developers and creatives, improving access to deep learning models by seamlessly integrating them into DAW workflows.

* ISMIR 2024 Late-Breaking Demo

Via

Access Paper or Ask Questions

Code Drift: Towards Idempotent Neural Audio Codecs

Oct 14, 2024

Patrick O'Reilly, Prem Seetharaman, Jiaqi Su, Zeyu Jin, Bryan Pardo

Abstract:Neural codecs have demonstrated strong performance in high-fidelity compression of audio signals at low bitrates. The token-based representations produced by these codecs have proven particularly useful for generative modeling. While much research has focused on improvements in compression ratio and perceptual transparency, recent works have largely overlooked another desirable codec property -- idempotence, the stability of compressed outputs under multiple rounds of encoding. We find that state-of-the-art neural codecs exhibit varied degrees of idempotence, with some degrading audio outputs significantly after as few as three encodings. We investigate possible causes of low idempotence and devise a method for improving idempotence through fine-tuning a codec model. We then examine the effect of idempotence on a simple conditional generative modeling task, and find that increased idempotence can be achieved without negatively impacting downstream modeling performance -- potentially extending the usefulness of neural codecs for practical file compression and iterative generative modeling workflows.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects

Sep 27, 2024

Annie Chu, Patrick O'Reilly, Julia Barnett, Bryan Pardo

Figure 1 for Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects

Figure 2 for Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects

Figure 3 for Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects

Figure 4 for Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects

Abstract:This work introduces Text2FX, a method that leverages CLAP embeddings and differentiable digital signal processing to control audio effects, such as equalization and reverberation, using open-vocabulary natural language prompts (e.g., "make this sound in-your-face and bold"). Text2FX operates without retraining any models, relying instead on single-instance optimization within the existing embedding space. We show that CLAP encodes valuable information for controlling audio effects and propose two optimization approaches using CLAP to map text to audio effect parameters. While we demonstrate with CLAP, this approach is applicable to any shared text-audio embedding space. Similarly, while we demonstrate with equalization and reverberation, any differentiable audio effect may be controlled. We conduct a listener study with diverse text prompts and source audio to evaluate the quality and alignment of these methods with human perception.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

Unsupervised Source Separation By Steering Pretrained Music Models

Oct 25, 2021

Ethan Manilow, Patrick O'Reilly, Prem Seetharaman, Bryan Pardo

Figure 1 for Unsupervised Source Separation By Steering Pretrained Music Models

Figure 2 for Unsupervised Source Separation By Steering Pretrained Music Models

Figure 3 for Unsupervised Source Separation By Steering Pretrained Music Models

Abstract:We showcase an unsupervised method that repurposes deep models trained for music generation and music tagging for audio source separation, without any retraining. An audio generation model is conditioned on an input mixture, producing a latent encoding of the audio used to generate audio. This generated audio is fed to a pretrained music tagger that creates source labels. The cross-entropy loss between the tag distribution for the generated audio and a predefined distribution for an isolated source is used to guide gradient ascent in the (unchanging) latent space of the generative model. This system does not update the weights of the generative model or the tagger, and only relies on moving through the generative model's latent space to produce separated sources. We use OpenAI's Jukebox as the pretrained generative model, and we couple it with four kinds of pretrained music taggers (two architectures and two tagging datasets). Experimental results on two source separation datasets, show this approach can produce separation estimates for a wider variety of sources than any tested supervised or unsupervised system. This work points to the vast and heretofore untapped potential of large pretrained music models for audio-to-audio tasks like source separation.

* Submitted to ICASSP 2022

Via

Access Paper or Ask Questions