Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yochai Yemini

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Jun 05, 2023

Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot, Ethan Fetaya

Abstract:Lip-to-speech involves generating a natural-sounding speech synchronized with a soundless video of a person talking. Despite recent advances, current methods still cannot produce high-quality speech with high levels of intelligibility for challenging and realistic datasets such as LRS3. In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained ASR serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary. Moreover, our experiments show that the inclusion of the text modality plays a major role in the intelligibility of the produced speech, readily perceptible while listening, and is empirically reflected in the substantial reduction of the WER metric. We demonstrate the effectiveness of LipVoicer through human evaluation, which shows that it produces more natural and synchronized speech signals compared to competing methods. Finally, we created a demo showcasing LipVoicer's superiority in producing natural, synchronized, and intelligible speech, providing additional evidence of its effectiveness. Project page: https://lipvoicer.github.io

Via

Access Paper or Ask Questions

GP-Tree: A Gaussian Process Classifier for Few-Shot Incremental Learning

Feb 15, 2021

Idan Achituve, Aviv Navon, Yochai Yemini, Gal Chechik, Ethan Fetaya

Figure 1 for GP-Tree: A Gaussian Process Classifier for Few-Shot Incremental Learning

Figure 2 for GP-Tree: A Gaussian Process Classifier for Few-Shot Incremental Learning

Figure 3 for GP-Tree: A Gaussian Process Classifier for Few-Shot Incremental Learning

Figure 4 for GP-Tree: A Gaussian Process Classifier for Few-Shot Incremental Learning

Abstract:Gaussian processes (GPs) are non-parametric, flexible, models that work well in many tasks. Combining GPs with deep learning methods via deep kernel learning is especially compelling due to the strong expressive power induced by the network. However, inference in GPs, whether with or without deep kernel learning, can be computationally challenging on large datasets. Here, we propose GP-Tree, a novel method for multi-class classification with Gaussian processes and deep kernel learning. We develop a tree-based hierarchical model in which each internal node of the tree fits a GP to the data using the Polya-Gamma augmentation scheme. As a result, our method scales well with both the number of classes and data size. We demonstrate our method effectiveness against other Gaussian process training baselines, and we show how our general GP approach is easily applied to incremental few-shot learning and reaches state-of-the-art performance.

Via

Access Paper or Ask Questions

Position-Agnostic Multi-Microphone Speech Dereverberation

Oct 22, 2020

Yochai Yemini, Ethan Fetaya, Haggai Maron, Sharon Gannot

Figure 1 for Position-Agnostic Multi-Microphone Speech Dereverberation

Figure 2 for Position-Agnostic Multi-Microphone Speech Dereverberation

Figure 3 for Position-Agnostic Multi-Microphone Speech Dereverberation

Figure 4 for Position-Agnostic Multi-Microphone Speech Dereverberation

Abstract:Neural networks (NNs) have been widely applied in speech processing tasks, and, in particular, those employing microphone arrays. Nevertheless, most of the existing NN architectures can only deal with fixed and position-specific microphone arrays. In this paper, we present an NN architecture that can cope with microphone arrays on which no prior knowledge is presumed, and demonstrate its applicability on the speech dereverberation problem. To this end, our approach harnesses recent advances in the Deep Sets framework to design an architecture that enhances the reverberant log-spectrum. We provide a setup for training and testing such a network. Our experiments, using REVERB challenge datasets, show that the proposed position-agnostic setup performs comparably with the position-aware framework and sometimes slightly better, even with fewer microphones. In addition, it substantially improves performance over a single microphone architecture.

Via

Access Paper or Ask Questions