Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:LiteVSR: Efficient Visual Speech Recognition by Learning from Speech Representations of Unlabeled Data

Dec 15, 2023

Hendrik Laux, Emil Mededovic, Ahmed Hallawa, Lukas Martin, Arne Peine, Anke Schmeink

Figure 1 for LiteVSR: Efficient Visual Speech Recognition by Learning from Speech Representations of Unlabeled Data

Figure 2 for LiteVSR: Efficient Visual Speech Recognition by Learning from Speech Representations of Unlabeled Data

Figure 3 for LiteVSR: Efficient Visual Speech Recognition by Learning from Speech Representations of Unlabeled Data

Figure 4 for LiteVSR: Efficient Visual Speech Recognition by Learning from Speech Representations of Unlabeled Data

Share this with someone who'll enjoy it:

Abstract:This paper proposes a novel, resource-efficient approach to Visual Speech Recognition (VSR) leveraging speech representations produced by any trained Automatic Speech Recognition (ASR) model. Moving away from the resource-intensive trends prevalent in recent literature, our method distills knowledge from a trained Conformer-based ASR model, achieving competitive performance on standard VSR benchmarks with significantly less resource utilization. Using unlabeled audio-visual data only, our baseline model achieves a word error rate (WER) of 47.4% and 54.7% on the LRS2 and LRS3 test benchmarks, respectively. After fine-tuning the model with limited labeled data, the word error rate reduces to 35% (LRS2) and 45.7% (LRS3). Our model can be trained on a single consumer-grade GPU within a few days and is capable of performing real-time end-to-end VSR on dated hardware, suggesting a path towards more accessible and resource-efficient VSR methodologies.

* Accepted for publication at ICASSP 2024

View paper on

Share this with someone who'll enjoy it:

Title:LiteVSR: Efficient Visual Speech Recognition by Learning from Speech Representations of Unlabeled Data

Paper and Code