Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition

Jun 30, 2019

Shaoshi Ling, Julian Salazar, Katrin Kirchhoff

Figure 1 for Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition

Figure 2 for Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition

Figure 3 for Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition

Figure 4 for Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition

Share this with someone who'll enjoy it:

Abstract:Pretrained contextual word representations in NLP have greatly improved performance on various downstream tasks. For speech, we propose contextual frame representations that capture phonetic information at the acoustic frame level and can be used for utterance-level language, speaker, and speech recognition. These representations come from the frame-wise intermediate representations of an end-to-end, self-attentive ASR model (SAN-CTC) on spoken utterances. We first train the model on the Fisher English corpus with context-independent phoneme labels, then use its representations at inference time as features for task-specific models on the NIST LRE07 closed-set language recognition task and a Fisher speaker recognition task, giving significant improvements over the state-of-the-art on both (e.g., language EER of 4.68% on 3sec utterances, 23% relative reduction in speaker EER). Results remain competitive when using a novel dilated convolutional model for language recognition, or when ASR pretraining is done with character labels only.

* submitted to INTERSPEECH 2019

View paper on

Share this with someone who'll enjoy it:

Title:Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition

Paper and Code