Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Julien Karadayi

DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Jun 22, 2022

Robin Algayres, Tristan Ricoul, Julien Karadayi, Hugo Laurençon, Salah Zaiem, Abdelrahman Mohamed, Benoît Sagot, Emmanuel Dupoux

Figure 1 for DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Figure 2 for DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Figure 3 for DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Figure 4 for DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Abstract:Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark.

Via

Access Paper or Ask Questions

Shennong: a Python toolbox for audio speech features extraction

Dec 10, 2021

Mathieu Bernard, Maxime Poli, Julien Karadayi, Emmanuel Dupoux

Figure 1 for Shennong: a Python toolbox for audio speech features extraction

Figure 2 for Shennong: a Python toolbox for audio speech features extraction

Figure 3 for Shennong: a Python toolbox for audio speech features extraction

Figure 4 for Shennong: a Python toolbox for audio speech features extraction

Abstract:We introduce Shennong, a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters such as Mel-Frequency Cepstral Filterbanks or Predictive Linear Filters, pre-trained neural networks, pitch estimators as well as speaker normalization methods and post-processing algorithms. Shennong is an open source, easy-to-use, reliable and extensible framework. The use of Python makes the integration to others speech modeling and machine learning tools easy. It aims to replace or complement several heterogeneous software, such as Kaldi or Praat. After describing the Shennong software architecture, its core components and implemented algorithms, this paper illustrates its use on three applications: a comparison of speech features performances on a phones discrimination task, an analysis of a Vocal Tract Length Normalization model as a function of the speech duration used for training and a comparison of pitch estimation algorithms under various noise conditions.

Via

Access Paper or Ask Questions

Learning spectro-temporal representations of complex sounds with parameterized neural networks

Mar 12, 2021

Rachid Riad, Julien Karadayi, Anne-Catherine Bachoud-Lévi, Emmanuel Dupoux

Figure 1 for Learning spectro-temporal representations of complex sounds with parameterized neural networks

Figure 2 for Learning spectro-temporal representations of complex sounds with parameterized neural networks

Figure 3 for Learning spectro-temporal representations of complex sounds with parameterized neural networks

Figure 4 for Learning spectro-temporal representations of complex sounds with parameterized neural networks

Abstract:Deep Learning models have become potential candidates for auditory neuroscience research, thanks to their recent successes on a variety of auditory tasks. Yet, these models often lack interpretability to fully understand the exact computations that have been performed. Here, we proposed a parametrized neural network layer, that computes specific spectro-temporal modulations based on Gabor kernels (Learnable STRFs) and that is fully interpretable. We evaluated predictive capabilities of this layer on Speech Activity Detection, Speaker Verification, Urban Sound Classification and Zebra Finch Call Type Classification. We found out that models based on Learnable STRFs are on par for all tasks with different toplines, and obtain the best performance for Speech Activity Detection. As this layer is fully interpretable, we used quantitative measures to describe the distribution of the learned spectro-temporal modulations. The filters adapted to each task and focused mostly on low temporal and spectral modulations. The analyses show that the filters learned on human speech have similar spectro-temporal parameters as the ones measured directly in the human auditory cortex. Finally, we observed that the tasks organized in a meaningful way: the human vocalizations tasks closer to each other and bird vocalizations far away from human vocalizations and urban sounds tasks.

Via

Access Paper or Ask Questions

The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units

Oct 12, 2020

Ewan Dunbar, Julien Karadayi, Mathieu Bernard, Xuan-Nga Cao, Robin Algayres, Lucas Ondel, Laurent Besacier, Sakriani Sakti, Emmanuel Dupoux

Figure 1 for The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units

Figure 2 for The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units

Figure 3 for The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units

Abstract:We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speech synthesis; the second one is to discover word-like units from unsegmented raw speech. We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.

* Proceedings of Interspeech 2020

Via

Access Paper or Ask Questions

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Dec 17, 2019

Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen(+5 more)

Figure 1 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 2 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 3 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 4 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Abstract:We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

Via

Access Paper or Ask Questions

The Zero Resource Speech Challenge 2019: TTS without T

Apr 25, 2019

Ewan Dunbar, Robin Algayres, Julien Karadayi, Mathieu Bernard, Juan Benjumea, Xuan-Nga Cao, Lucie Miskic, Charlotte Dugrain, Lucas Ondel, Alan W. Black(+3 more)

Figure 1 for The Zero Resource Speech Challenge 2019: TTS without T

Figure 2 for The Zero Resource Speech Challenge 2019: TTS without T

Figure 3 for The Zero Resource Speech Challenge 2019: TTS without T

Figure 4 for The Zero Resource Speech Challenge 2019: TTS without T

Abstract:We present the Zero Resource Speech Challenge 2019, which proposes to build a speech synthesizer without any text or phonetic labels: hence, TTS without T (text-to-speech without text). We provide raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery dataset) and align them to the voice recordings in a way that works best for the purpose of synthesizing novel utterances from novel speakers, similar to the target speaker's voice. We describe the metrics used for evaluation, a baseline system consisting of unsupervised subword unit discovery plus a standard TTS system, and a topline TTS using gold phoneme transcriptions. We present an overview of the 19 submitted systems from 11 teams and discuss the main results.

* Interspeech 2019

Via

Access Paper or Ask Questions

Sampling strategies in Siamese Networks for unsupervised speech representation learning

Aug 23, 2018

Rachid Riad, Corentin Dancette, Julien Karadayi, Neil Zeghidour, Thomas Schatz, Emmanuel Dupoux

Figure 1 for Sampling strategies in Siamese Networks for unsupervised speech representation learning

Figure 2 for Sampling strategies in Siamese Networks for unsupervised speech representation learning

Figure 3 for Sampling strategies in Siamese Networks for unsupervised speech representation learning

Figure 4 for Sampling strategies in Siamese Networks for unsupervised speech representation learning

Abstract:Recent studies have investigated siamese network architectures for learning invariant speech representations using same-different side information at the word level. Here we investigate systematically an often ignored component of siamese networks: the sampling procedure (how pairs of same vs. different tokens are selected). We show that sampling strategies taking into account Zipf's Law, the distribution of speakers and the proportions of same and different pairs of words significantly impact the performance of the network. In particular, we show that word frequency compression improves learning across a large range of variations in number of training pairs. This effect does not apply to the same extent to the fully unsupervised setting, where the pairs of same-different words are obtained by spoken term discovery. We apply these results to pairs of words discovered using an unsupervised algorithm and show an improvement on state-of-the-art in unsupervised representation learning using siamese networks.

* Conference paper at Interspeech 2018

Via

Access Paper or Ask Questions

The Zero Resource Speech Challenge 2017

Dec 12, 2017

Ewan Dunbar, Xuan Nga Cao, Juan Benjumea, Julien Karadayi, Mathieu Bernard, Laurent Besacier, Xavier Anguera, Emmanuel Dupoux

Figure 1 for The Zero Resource Speech Challenge 2017

Figure 2 for The Zero Resource Speech Challenge 2017

Figure 3 for The Zero Resource Speech Challenge 2017

Abstract:We describe a new challenge aimed at discovering subword and word units from raw speech. This challenge is the followup to the Zero Resource Speech Challenge 2015. It aims at constructing systems that generalize across languages and adapt to new speakers. The design features and evaluation metrics of the challenge are presented and the results of seventeen models are discussed.

* IEEE ASRU (Automatic Speech Recognition and Understanding) 2017. Okinawa, Japan

Via

Access Paper or Ask Questions