Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yehoshua Dissen

HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing

Jul 10, 2024

Arnon Turetzky, Or Tal, Yael Segal-Feldman, Yehoshua Dissen, Ella Zeldes, Amit Roth, Eyal Cohen, Yosi Shrem, Bronya R. Chernyak, Olga Seleznova(+2 more)

Abstract:We present HebDB, a weakly supervised dataset for spoken language processing in the Hebrew language. HebDB offers roughly 2500 hours of natural and spontaneous speech recordings in the Hebrew language, consisting of a large variety of speakers and topics. We provide raw recordings together with a pre-processed, weakly supervised, and filtered version. The goal of HebDB is to further enhance research and development of spoken language processing tools for the Hebrew language. Hence, we additionally provide two baseline systems for Automatic Speech Recognition (ASR): (i) a self-supervised model; and (ii) a fully supervised model. We present the performance of these two methods optimized on HebDB and compare them to current multi-lingual ASR alternatives. Results suggest the proposed method reaches better results than the evaluated baselines considering similar model sizes. Dataset, code, and models are publicly available under https://pages.cs.huji.ac.il/adiyoss-lab/HebDB/.

* Accepted at Interspeech2024

Via

Access Paper or Ask Questions

Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network

Jun 27, 2024

Yehoshua Dissen, Shiry Yonash, Israel Cohen, Joseph Keshet

Figure 1 for Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network

Figure 2 for Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network

Figure 3 for Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network

Figure 4 for Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network

Abstract:In the realm of automatic speech recognition (ASR), robustness in noisy environments remains a significant challenge. Recent ASR models, such as Whisper, have shown promise, but their efficacy in noisy conditions can be further enhanced. This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models. We propose using a front-end adaptation network connected to a frozen ASR model. The adaptation network is trained to modify the corrupted input spectrum by minimizing the criteria of the ASR model in addition to an enhancement loss function. Our experiments demonstrate that the adaptation network, trained on Whisper's criteria, notably reduces word error rates across domains and languages in packet-loss scenarios. This improvement is achieved with minimal affect to Whisper model's foundational performance, underscoring our method's practicality and potential in enhancing ASR models in challenging acoustic environments.

* Accepted for publication at INTERSPEECH 2024

Via

Access Paper or Ask Questions

Self-supervised Speaker Diarization

Apr 08, 2022

Yehoshua Dissen, Felix Kreuk, Joseph Keshet

Figure 1 for Self-supervised Speaker Diarization

Figure 2 for Self-supervised Speaker Diarization

Figure 3 for Self-supervised Speaker Diarization

Figure 4 for Self-supervised Speaker Diarization

Abstract:Over the last few years, deep learning has grown in popularity for speaker verification, identification, and diarization. Inarguably, a significant part of this success is due to the demonstrated effectiveness of their speaker representations. These, however, are heavily dependent on large amounts of annotated data and can be sensitive to new domains. This study proposes an entirely unsupervised deep-learning model for speaker diarization. Specifically, the study focuses on generating high-quality neural speaker representations without any annotated data, as well as on estimating secondary hyperparameters of the model without annotations. The speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker. The trained encoder model is then used to self-generate pseudo-labels to subsequently train a similarity score between different segments of the same call using probabilistic linear discriminant analysis (PLDA) and further to learn a clustering stopping threshold. We compared our model to state-of-the-art unsupervised as well as supervised baselines on the CallHome benchmarks. According to empirical results, our approach outperforms unsupervised methods when only two speakers are present in the call, and is only slightly worse than recent supervised models.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Domain Adaptation For Formant Estimation Using Deep Learning

Nov 06, 2016

Yehoshua Dissen, Joseph Keshet, Jacob Goldberger, Cynthia Clopper

Figure 1 for Domain Adaptation For Formant Estimation Using Deep Learning

Figure 2 for Domain Adaptation For Formant Estimation Using Deep Learning

Figure 3 for Domain Adaptation For Formant Estimation Using Deep Learning

Figure 4 for Domain Adaptation For Formant Estimation Using Deep Learning

Abstract:In this paper we present a domain adaptation technique for formant estimation using a deep network. We first train a deep learning network on a small read speech dataset. We then freeze the parameters of the trained network and use several different datasets to train an adaptation layer that makes the obtained network universal in the sense that it works well for a variety of speakers and speech domains with very different characteristics. We evaluated our adapted network on three datasets, each of which has different speaker characteristics and speech styles. The performance of our method compares favorably with alternative methods for formant estimation.

Via

Access Paper or Ask Questions