Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrea Fanelli

Are Deep Speech Denoising Models Robust to Adversarial Noise?

Mar 14, 2025

Will Schwarzer, Philip S. Thomas, Andrea Fanelli, Xiaoyu Liu

Abstract:Deep noise suppression (DNS) models enjoy widespread use throughout a variety of high-stakes speech applications. However, in this paper, we show that four recent DNS models can each be reduced to outputting unintelligible gibberish through the addition of imperceptible adversarial noise. Furthermore, our results show the near-term plausibility of targeted attacks, which could induce models to output arbitrary utterances, and over-the-air attacks. While the success of these attacks varies by model and setting, and attacks appear to be strongest when model-specific (i.e., white-box and non-transferable), our results highlight a pressing need for practical countermeasures in DNS systems.

* 13 pages, 5 figures

Via

Access Paper or Ask Questions

XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

Feb 07, 2025

Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli

Figure 1 for XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

Figure 2 for XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

Figure 3 for XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

Figure 4 for XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

Abstract:The rapid proliferation of generative audio synthesis and editing technologies has raised significant concerns about copyright infringement, data provenance, and the spread of misinformation through deepfake audio. Watermarking offers a proactive solution by embedding imperceptible, identifiable, and traceable marks into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to achieve both robust detection and accurate attribution simultaneously. This paper introduces Cross-Attention Robust Audio Watermark (XAttnMark), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned temporal-frequency masking loss that captures fine-grained auditory masking effects, enhancing watermark imperceptibility. Our approach achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing with strong editing strength. The project webpage is available at https://liuyixin-louis.github.io/xattnmark/.

* 24 pages, 10 figures

Via

Access Paper or Ask Questions

AudioMiXR: Spatial Audio Object Manipulation with 6DoF for Sound Design in Augmented Reality

Feb 05, 2025

Brandon Woodard, Margarita Geleta, Joseph J. LaViola Jr., Andrea Fanelli, Rhonda Wilson

Figure 1 for AudioMiXR: Spatial Audio Object Manipulation with 6DoF for Sound Design in Augmented Reality

Figure 2 for AudioMiXR: Spatial Audio Object Manipulation with 6DoF for Sound Design in Augmented Reality

Figure 3 for AudioMiXR: Spatial Audio Object Manipulation with 6DoF for Sound Design in Augmented Reality

Figure 4 for AudioMiXR: Spatial Audio Object Manipulation with 6DoF for Sound Design in Augmented Reality

Abstract:We present AudioMiXR, an augmented reality (AR) interface intended to assess how users manipulate virtual audio objects situated in their physical space using six degrees of freedom (6DoF) deployed on a head-mounted display (Apple Vision Pro) for 3D sound design. Existing tools for 3D sound design are typically constrained to desktop displays, which may limit spatial awareness of mixing within the execution environment. Utilizing an XR HMD to create soundscapes may provide a real-time test environment for 3D sound design, as modern HMDs can provide precise spatial localization assisted by cross-modal interactions. However, there is no research on design guidelines specific to sound design with six degrees of freedom (6DoF) in XR. To provide a first step toward identifying design-related research directions in this space, we conducted an exploratory study where we recruited 27 participants, consisting of expert and non-expert sound designers. The goal was to assess design lessons that can be used to inform future research venues in 3D sound design. We ran a within-subjects study where users designed both a music and cinematic soundscapes. After thematically analyzing participant data, we constructed two design lessons: 1. Proprioception for AR Sound Design, and 2. Balancing Audio-Visual Modalities in AR GUIs. Additionally, we provide application domains that can benefit most from 6DoF sound design based on our results.

* 34 pages, 18 Figures

Via

Access Paper or Ask Questions

Accent Conversion with Articulatory Representations

Jun 10, 2024

Yashish M. Siriwardena, Nathan Swedlow, Audrey Howard, Evan Gitterman, Dan Darcy, Carol Espy-Wilson, Andrea Fanelli

Figure 1 for Accent Conversion with Articulatory Representations

Figure 2 for Accent Conversion with Articulatory Representations

Figure 3 for Accent Conversion with Articulatory Representations

Figure 4 for Accent Conversion with Articulatory Representations

Abstract:Conversion of non-native accented speech to native (American) English has a wide range of applications such as improving intelligibility of non-native speech. Previous work on this domain has used phonetic posteriograms as the target speech representation to train an acoustic model which is then used to extract a compact representation of input speech for accent conversion. In this work, we introduce the idea of using an effective articulatory speech representation, extracted from an acoustic-to-articulatory speech inversion system, to improve the acoustic model used in accent conversion. The idea to incorporate articulatory representations originates from their ability to well characterize accents in speech. To incorporate articulatory representations with conventional phonetic posteriograms, a multi-task learning based acoustic model is proposed. Objective and subjective evaluations show that the use of articulatory representations can improve the effectiveness of accent conversion.

* Accepted at INTERSPEECH 2024

Via

Access Paper or Ask Questions

Low latency transformers for speech processing

Feb 27, 2023

Jianbo Ma, Siqi Pan, Deepak Chandran, Andrea Fanelli, Richard Cartwright

Abstract:The transformer is a widely-used building block in modern neural networks. However, when applied to audio data, the transformer's acausal behaviour, which we term Acausal Attention (AA), has generally limited its application to offline tasks. In this paper we introduce Streaming Attention (SA), which operates causally with fixed latency, and requires lower compute and memory resources than AA to train. Next, we introduce Low Latency Streaming Attention (LLSA), a method which combines multiple SA layers without latency build-up proportional to the layer count. Comparative analysis between AA, SA and LLSA on Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER) tasks are presented. The results show that causal SA-based networks with fixed latencies of a few seconds (e.g. 1.8 seconds) and LLSA networks with latencies as short as 300 ms can perform comparably with acausal (AA) networks. We conclude that SA and LLSA methods retain many of the benefits of conventional acausal transformers, but with latency characteristics that make them practical to run in real-time streaming applications.

* 6 pages, 3 figures

Via

Access Paper or Ask Questions