Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Jun 22, 2022

Robin Algayres, Tristan Ricoul, Julien Karadayi, Hugo Laurençon, Salah Zaiem, Abdelrahman Mohamed, Benoît Sagot, Emmanuel Dupoux

Figure 1 for DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Figure 2 for DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Figure 3 for DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Figure 4 for DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Share this with someone who'll enjoy it:

Abstract:Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark.

View paper on

Share this with someone who'll enjoy it:

Title:DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Paper and Code