Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

Dec 06, 2022

Jing-Xuan Zhang, Genshun Wan, Zhen-Hua Ling, Jia Pan, Jianqing Gao, Cong Liu

Figure 1 for Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

Figure 2 for Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

Figure 3 for Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

Figure 4 for Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

Share this with someone who'll enjoy it:

Abstract:In this work, we present a novel method, named AV2vec, for learning audio-visual speech representations by multimodal self-distillation. AV2vec has a student and a teacher module, in which the student performs a masked latent feature regression task using the multimodal target features generated online by the teacher. The parameters of the teacher model are a momentum update of the student. Since our target features are generated online, AV2vec needs no iteration step like AV-HuBERT and the total training time cost is reduced to less than one-fifth. We further propose AV2vec-MLM in this study, which augments AV2vec with a masked language model (MLM)-style loss using multitask learning. Our experimental results show that AV2vec achieved comparable performance to the AV-HuBERT baseline. When combined with an MLM-style loss, AV2vec-MLM outperformed baselines and achieved the best performance on the downstream tasks.

* submitted to ICASSP 2023

View paper on

Share this with someone who'll enjoy it:

Title:Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

Paper and Code