Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Jul 08, 2022

Xianrui Zheng, Chao Zhang, Philip C. Woodland

Figure 1 for Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Figure 2 for Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Figure 3 for Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Figure 4 for Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Share this with someone who'll enjoy it:

Abstract:Self-supervised-learning-based pre-trained models for speech data, such as Wav2Vec 2.0 (W2V2), have become the backbone of many speech tasks. In this paper, to achieve speaker diarisation and speech recognition using a single model, a tandem multitask training (TMT) method is proposed to fine-tune W2V2. For speaker diarisation, the tasks of voice activity detection (VAD) and speaker classification (SC) are required, and connectionist temporal classification (CTC) is used for ASR. The multitask framework implements VAD, SC, and ASR using an early layer, middle layer, and late layer of W2V2, which coincides with the order of segmenting the audio with VAD, clustering the segments based on speaker embeddings, and transcribing each segment with ASR. Experimental results on the augmented multi-party (AMI) dataset showed that using different W2V2 layers for VAD, SC, and ASR from the earlier to later layers for TMT not only saves computational cost, but also reduces diarisation error rates (DERs). Joint fine-tuning of VAD, SC, and ASR yielded 16%/17% relative reductions of DER with manual/automatic segmentation respectively, and consistent reductions in speaker attributed word error rate, compared to the baseline with separately fine-tuned models.

* To appear in Interspeech 2022

View paper on

Share this with someone who'll enjoy it:

Title:Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Paper and Code