Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings

Dec 14, 2020

Lisa van Staden, Herman Kamper

Figure 1 for A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings

Figure 2 for A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings

Figure 3 for A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings

Figure 4 for A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings

Share this with someone who'll enjoy it:

Abstract:Many speech processing tasks involve measuring the acoustic similarity between speech segments. Acoustic word embeddings (AWE) allow for efficient comparisons by mapping speech segments of arbitrary duration to fixed-dimensional vectors. For zero-resource speech processing, where unlabelled speech is the only available resource, some of the best AWE approaches rely on weak top-down constraints in the form of automatically discovered word-like segments. Rather than learning embeddings at the segment level, another line of zero-resource research has looked at representation learning at the short-time frame level. Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models. In this paper we consider whether these frame-level features are beneficial when used as inputs for training to an unsupervised AWE model. We compare frame-level features from contrastive predictive coding (CPC), autoregressive predictive coding and a CAE to conventional MFCCs. These are used as inputs to a recurrent CAE-based AWE model. In a word discrimination task on English and Xitsonga data, all three representation learning approaches outperform MFCCs, with CPC consistently showing the biggest improvement. In cross-lingual experiments we find that CPC features trained on English can also be transferred to Xitsonga.

* Accepted to SLT 2021

View paper on

Share this with someone who'll enjoy it:

Title:A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings

Paper and Code