Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Luna McNulty

PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models

Dec 13, 2023

Robin Netzorg, Ajil Jalal, Luna McNulty, Gopala Krishna Anumanchipalli

Figure 1 for PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models

Figure 2 for PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models

Figure 3 for PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models

Figure 4 for PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models

Abstract:Perceptual modification of voice is an elusive goal. While non-experts can modify an image or sentence perceptually with available tools, it is not clear how to similarly modify speech along perceptual axes. Voice conversion does make it possible to convert one voice to another, but these modifications are handled by black box models, and the specifics of what perceptual qualities to modify and how to modify them are unclear. Towards allowing greater perceptual control over voice, we introduce PerMod, a conditional latent diffusion model that takes in an input voice and a perceptual qualities vector, and produces a voice with the matching perceptual qualities. Unlike prior work, PerMod generates a new voice corresponding to specific perceptual modifications. Evaluating perceptual quality vectors with RMSE from both human and predicted labels, we demonstrate that PerMod produces voices with the desired perceptual qualities for typical voices, but performs poorly on atypical voices.

Via

Access Paper or Ask Questions

Towards an Interpretable Representation of Speaker Identity via Perceptual Voice Qualities

Oct 04, 2023

Robin Netzorg, Bohan Yu, Andrea Guzman, Peter Wu, Luna McNulty, Gopala Anumanchipalli

Figure 1 for Towards an Interpretable Representation of Speaker Identity via Perceptual Voice Qualities

Figure 2 for Towards an Interpretable Representation of Speaker Identity via Perceptual Voice Qualities

Figure 3 for Towards an Interpretable Representation of Speaker Identity via Perceptual Voice Qualities

Abstract:Unlike other data modalities such as text and vision, speech does not lend itself to easy interpretation. While lay people can understand how to describe an image or sentence via perception, non-expert descriptions of speech often end at high-level demographic information, such as gender or age. In this paper, we propose a possible interpretable representation of speaker identity based on perceptual voice qualities (PQs). By adding gendered PQs to the pathology-focused Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) protocol, our PQ-based approach provides a perceptual latent space of the character of adult voices that is an intermediary of abstraction between high-level demographics and low-level acoustic, physical, or learned representations. Contrary to prior belief, we demonstrate that these PQs are hearable by ensembles of non-experts, and further demonstrate that the information encoded in a PQ-based representation is predictable by various speech representations.

Via

Access Paper or Ask Questions