Abstract:In recent years, automatic speech recognition (ASR) models greatly improved transcription performance both in clean, low noise, acoustic conditions and in reverberant environments. However, all these systems rely on the availability of hundreds of hours of labelled training data in specific acoustic conditions. When such a training dataset is not available, the performance of the system is heavily impacted. For example, this happens when a specific acoustic environment or a particular population of speakers is under-represented in the training dataset. Specifically, in this paper we investigate the effect of accented speech data on an off-the-shelf ASR system. Furthermore, we suggest a strategy based on zero-shot text-to-speech to augment the accented speech corpora. We show that this augmentation method is able to mitigate the loss in performance of the ASR system on accented data up to 5% word error rate reduction (WERR). In conclusion, we demonstrate that by incorporating a modest fraction of real with synthetically generated data, the ASR system exhibits superior performance compared to a model trained exclusively on authentic accented speech with up to 14% WERR.
Abstract:In recent years, the need for privacy preservation when manipulating or storing personal data, including speech , has become a major issue. In this paper, we present a system addressing the speaker-level anonymization problem. We propose and evaluate a two-stage anonymization pipeline exploiting a state-of-the-art anonymization model described in the Voice Privacy Challenge 2022 in combination with a zero-shot voice conversion architecture able to capture speaker characteristics from a few seconds of speech. We show this architecture can lead to strong privacy preservation while preserving pitch information. Finally, we propose a new compressed metric to evaluate anonymization systems in privacy scenarios with different constraints on privacy and utility.
Abstract:Any audio recording encapsulates the unique fingerprint of the associated acoustic environment, namely the background noise and reverberation. Considering the scenario of a room equipped with a fixed smart speaker device with one or more microphones and a wearable smart device (watch, glasses or smartphone), we employed the improved proportionate normalized least mean square adaptive filter to estimate the relative room impulse response mapping the audio recordings of the two devices. We performed inter-device distance estimation by exploiting a new set of features obtained extending the definition of some acoustic attributes of the room impulse response to its relative version. In combination with the sparseness measure of the estimated relative room impulse response, the relative features allow precise inter-device distance estimation which can be exploited for tasks such as best microphone selection or acoustic scene analysis. Experimental results from simulated rooms of different dimensions and reverberation times demonstrate the effectiveness of this computationally lightweight approach for smart home acoustic ranging applications