IP Paris
Abstract:As the French, European and worldwide populations are aging, there is a strong interest for new systems that guarantee a reliable and privacy preserving home monitoring for frailty prevention. This work is a part of a global environmental audio analysis system which aims to help identification of Activities of Daily Life (ADL) through human and everyday life sounds recognition, speech presence and number of speakers detection. The focus is made on the number of speakers detection. In this article, we present how recent advances in sound processing and speaker diarization can improve the existing embedded systems. We study the performances of two new methods and discuss the benefits of DNN based approaches which improve performances by about 100%.
Abstract:Speaker diarization, or the task of finding "who spoke and when", is now used in almost every speech processing application. Nevertheless, its fairness has not yet been evaluated because there was no protocol to study its biases one by one. In this paper we propose a protocol and a scoring method designed to evaluate speaker diarization fairness. This protocol is applied on a large dataset of spoken utterances and report the performances of speaker diarization depending on the gender, the age, the accent of the speaker and the length of the spoken sentence. Some biases induced by the gender, or the accent of the speaker were identified when we applied a state-of-the-art speaker diarization method.
Abstract:We describe the system used by our team for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC 2022) in the speaker diarization track. Our solution was designed around a new combination of voice activity detection algorithms that uses the strengths of several systems. We introduce a novel multi stream approach with a decision protocol based on classifiers entropy. We called this method a multi-stream voice activity detection and used it with standard baseline diarization embeddings, clustering and resegmentation. With this work, we successfully demonstrated that using a strong baseline and working only on voice activity detection, one can achieved close to state-of-theart results.