Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anders R. Bargum

RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

Aug 29, 2024

Anders R. Bargum, Simon Lajboschitz, Cumhur Erkut

Figure 1 for RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

Figure 2 for RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

Figure 3 for RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

Figure 4 for RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

Abstract:Voice conversion has gained increasing popularity within the field of audio manipulation and speech synthesis. Often, the main objective is to transfer the input identity to that of a target speaker without changing its linguistic content. While current work provides high-fidelity solutions they rarely focus on model simplicity, high-sampling rate environments or stream-ability. By incorporating speech representation learning into a generative timbre transfer model, traditionally created for musical purposes, we investigate the realm of voice conversion generated directly in the time domain at high sampling rates. More specifically, we guide the latent space of a baseline model towards linguistically relevant representations and condition it on external speaker information. Through objective and subjective assessments, we demonstrate that the proposed solution can attain levels of naturalness, quality, and intelligibility comparable to those of a state-of-the-art solution for seen speakers, while significantly decreasing inference time. However, despite the presence of target speaker characteristics in the converted output, the actual similarity to unseen speakers remains a challenge.

* Accepted for publication in Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), Guildford, United Kingdom, 3 - 7 September 2024

Via

Access Paper or Ask Questions

Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion

Nov 14, 2023

Anders R. Bargum, Stefania Serafin, Cumhur Erkut

Abstract:Research on deep learning-powered voice conversion (VC) in speech-to-speech scenarios is getting increasingly popular. Although many of the works in the field of voice conversion share a common global pipeline, there is a considerable diversity in the underlying structures, methods, and neural sub-blocks used across research efforts. Thus, obtaining a comprehensive understanding of the reasons behind the choice of the different methods in the voice conversion pipeline can be challenging, and the actual hurdles in the proposed solutions are often unclear. To shed light on these aspects, this paper presents a scoping review that explores the use of deep learning in speech analysis, synthesis, and disentangled speech representation learning within modern voice conversion systems. We screened 621 publications from more than 38 different venues between the years 2017 and 2023, followed by an in-depth review of a final database consisting of 123 eligible studies. Based on the review, we summarise the most frequently used approaches to voice conversion based on deep learning and highlight common pitfalls within the community. Lastly, we condense the knowledge gathered, identify main challenges and provide recommendations for future research directions.

Via

Access Paper or Ask Questions

Differentiable Allpass Filters for Phase Response Estimation and Automatic Signal Alignment

Jun 02, 2023

Anders R. Bargum, Stefania Serafin, Cumhur Erkut, Julian D. Parker

Figure 1 for Differentiable Allpass Filters for Phase Response Estimation and Automatic Signal Alignment

Figure 2 for Differentiable Allpass Filters for Phase Response Estimation and Automatic Signal Alignment

Figure 3 for Differentiable Allpass Filters for Phase Response Estimation and Automatic Signal Alignment

Figure 4 for Differentiable Allpass Filters for Phase Response Estimation and Automatic Signal Alignment

Abstract:Virtual analog (VA) audio effects are increasingly based on neural networks and deep learning frameworks. Due to the underlying black-box methodology, a successful model will learn to approximate the data it is presented, including potential errors such as latency and audio dropouts as well as non-linear characteristics and frequency-dependent phase shifts produced by the hardware. The latter is of particular interest as the learned phase-response might cause unwanted audible artifacts when the effect is used for creative processing techniques such as dry-wet mixing or parallel compression. To overcome these artifacts we propose differentiable signal processing tools and deep optimization structures for automatically tuning all-pass filters to predict the phase response of different VA simulations, and align processed signals that are out of phase. The approaches are assessed using objective metrics while listening tests evaluate their ability to enhance the quality of parallel path processing techniques. Ultimately, an over-parameterized, BiasNet-based, all-pass model is proposed for the optimization problem under consideration, resulting in models that can estimate all-pass filter coefficients to align a dry signal with its affected, wet, equivalent.

* Collaboration done while interning/employed at Native Instruments. Accepted for publication in Proc. DAFX'23, Copenhagen, Denmark, September 2023. Sound examples at https://abargum.github.io v2: 10 pages, LaTeX; figures resized, pdf optimized

Via

Access Paper or Ask Questions