Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events

Sep 25, 2024

Xiaoyu Yang, Qiujia Li, Chao Zhang, Phil Woodland

Figure 1 for MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events

Figure 2 for MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events

Figure 3 for MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events

Figure 4 for MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events

Share this with someone who'll enjoy it:

Abstract:With the advances in deep learning, the performance of end-to-end (E2E) single-task models for speech and audio processing has been constantly improving. However, it is still challenging to build a general-purpose model with high performance on multiple tasks, since different speech and audio processing tasks usually require different training data, input features, or model architectures to achieve optimal performance. In this work, MT2KD, a novel two-stage multi-task learning framework is proposed to build a general-purpose speech and audio encoder that jointly performs three fundamental tasks: automatic speech recognition (ASR), audio tagging (AT) and speaker verification (SV). In the first stage, multi-teacher knowledge distillation (KD) is applied to align the feature spaces of three single-task high-performance teacher encoders into a single student encoder using the same unlabelled data. In the second stage, multi-task supervised fine-tuning is carried out by initialising the model from the first stage and training on the separate labelled data of each single task. Experiments demonstrate that the proposed multi-task training pipeline significantly outperforms a baseline model trained with multi-task learning from scratch. The final system achieves good performance on ASR, AT and SV: with less than 4% relative word-error-rate increase on ASR, only 1.9 lower mean averaged precision on AT and 0.23% absolute higher equal error rate on SV compared to the best-performing single-task encoders, using only a 66M total model parameters.

View paper on

Share this with someone who'll enjoy it:

Title:MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events

Paper and Code