Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Mar 01, 2023

Max Bain, Jaesung Huh, Tengda Han, Andrew Zisserman

Figure 1 for WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Figure 2 for WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Figure 3 for WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Figure 4 for WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Share this with someone who'll enjoy it:

Abstract:Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelve-fold transcription speedup via batched inference.

View paper on

Share this with someone who'll enjoy it:

Title:WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Paper and Code