Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Donald Williamson

JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs

Jul 15, 2025

Junyi Fan, Donald Williamson

Abstract:Speech quality assessment (SQA) is often used to learn a mapping from a high-dimensional input space to a scalar that represents the mean opinion score (MOS) of the perceptual speech quality. Learning such a mapping is challenging for many reasons, but largely because MOS exhibits high levels of inherent variance due to perceptual and experimental-design differences. Many solutions have been proposed, but many approaches do not properly incorporate perceptual factors into their learning algorithms (beyond the MOS label), which could lead to unsatisfactory results. To this end, we propose JSQA, a two-stage framework that pretrains an audio encoder using perceptually-guided contrastive learning on just noticeable difference (JND) pairs, followed by fine-tuning for MOS prediction. We first generate pairs of audio data within JND levels, which are then used to pretrain an encoder to leverage perceptual quality similarity information and map it into an embedding space. The JND pairs come from clean LibriSpeech utterances that are mixed with background noise from CHiME-3, at different signal-to-noise ratios (SNRs). The encoder is later fine-tuned with audio samples from the NISQA dataset for MOS prediction. Experimental results suggest that perceptually-inspired contrastive pretraining significantly improves the model performance evaluated by various metrics when compared against the same network trained from scratch without pretraining. These findings suggest that incorporating perceptual factors into pretraining greatly contributes to the improvement in performance for SQA.

* Accepted to WASPAA 2025

Via

Access Paper or Ask Questions

Building Trust Through Voice: How Vocal Tone Impacts User Perception of Attractiveness of Voice Assistants

Sep 27, 2024

Sabid Bin Habib Pias, Alicia Freel, Ran Huang, Donald Williamson, Minjeong Kim, Apu Kapadia

Figure 1 for Building Trust Through Voice: How Vocal Tone Impacts User Perception of Attractiveness of Voice Assistants

Figure 2 for Building Trust Through Voice: How Vocal Tone Impacts User Perception of Attractiveness of Voice Assistants

Figure 3 for Building Trust Through Voice: How Vocal Tone Impacts User Perception of Attractiveness of Voice Assistants

Figure 4 for Building Trust Through Voice: How Vocal Tone Impacts User Perception of Attractiveness of Voice Assistants

Abstract:Voice Assistants (VAs) are popular for simple tasks, but users are often hesitant to use them for complex activities like online shopping. We explored whether the vocal characteristics like the VA's vocal tone, can make VAs perceived as more attractive and trustworthy to users for complex tasks. Our findings show that the tone of the VA voice significantly impacts its perceived attractiveness and trustworthiness. Participants in our experiment were more likely to be attracted to VAs with positive or neutral tones and ultimately trusted the VAs they found more attractive. We conclude that VA's perceived trustworthiness can be enhanced through thoughtful voice design, incorporating a variety of vocal tones.

* Extended Abstract

Via

Access Paper or Ask Questions

CORN: Co-Trained Full-Reference And No-Reference Audio Metrics

Oct 13, 2023

Pranay Manocha, Donald Williamson, Adam Finkelstein

Figure 1 for CORN: Co-Trained Full-Reference And No-Reference Audio Metrics

Figure 2 for CORN: Co-Trained Full-Reference And No-Reference Audio Metrics

Figure 3 for CORN: Co-Trained Full-Reference And No-Reference Audio Metrics

Abstract:Perceptual evaluation constitutes a crucial aspect of various audio-processing tasks. Full reference (FR) or similarity-based metrics rely on high-quality reference recordings, to which lower-quality or corrupted versions of the recording may be compared for evaluation. In contrast, no-reference (NR) metrics evaluate a recording without relying on a reference. Both the FR and NR approaches exhibit advantages and drawbacks relative to each other. In this paper, we present a novel framework called CORN that amalgamates these dual approaches, concurrently training both FR and NR models together. After training, the models can be applied independently. We evaluate CORN by predicting several common objective metrics and across two different architectures. The NR model trained using CORN has access to a reference recording during training, and thus, as one would expect, it consistently outperforms baseline NR models trained independently. Perhaps even more remarkable is that the CORN FR model also outperforms its baseline counterpart, even though it relies on the same training data and the same model architecture. Thus, a single training regime produces two independently useful models, each outperforming independently trained models.

Via

Access Paper or Ask Questions

Privacy-preserving and Privacy-attacking Approaches for Speech and Audio -- A Survey

Sep 26, 2023

Yuchen Liu, Apu Kapadia, Donald Williamson

Figure 1 for Privacy-preserving and Privacy-attacking Approaches for Speech and Audio -- A Survey

Figure 2 for Privacy-preserving and Privacy-attacking Approaches for Speech and Audio -- A Survey

Figure 3 for Privacy-preserving and Privacy-attacking Approaches for Speech and Audio -- A Survey

Figure 4 for Privacy-preserving and Privacy-attacking Approaches for Speech and Audio -- A Survey

Abstract:In contemporary society, voice-controlled devices, such as smartphones and home assistants, have become pervasive due to their advanced capabilities and functionality. The always-on nature of their microphones offers users the convenience of readily accessing these devices. However, recent research and events have revealed that such voice-controlled devices are prone to various forms of malicious attacks, hence making it a growing concern for both users and researchers to safeguard against such attacks. Despite the numerous studies that have investigated adversarial attacks and privacy preservation for images, a conclusive study of this nature has not been conducted for the audio domain. Therefore, this paper aims to examine existing approaches for privacy-preserving and privacy-attacking strategies for audio and speech. To achieve this goal, we classify the attack and defense scenarios into several categories and provide detailed analysis of each approach. We also interpret the dissimilarities between the various approaches, highlight their contributions, and examine their limitations. Our investigation reveals that voice-controlled devices based on neural networks are inherently susceptible to specific types of attacks. Although it is possible to enhance the robustness of such models to certain forms of attack, more sophisticated approaches are required to comprehensively safeguard user privacy.

Via

Access Paper or Ask Questions