Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tal Peer

LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models

May 16, 2025

Danilo de Oliveira, Julius Richter, Tal Peer, Timo Germann

Abstract:We present LipDiffuser, a conditional diffusion model for lip-to-speech generation synthesizing natural and intelligible speech directly from silent video recordings. Our approach leverages the magnitude-preserving ablated diffusion model (MP-ADM) architecture as a denoiser model. To effectively condition the model, we incorporate visual features using magnitude-preserving feature-wise linear modulation (MP-FiLM) alongside speaker embeddings. A neural vocoder then reconstructs the speech waveform from the generated mel-spectrograms. Evaluations on LRS3 and TCD-TIMIT demonstrate that LipDiffuser outperforms existing lip-to-speech baselines in perceptual speech quality and speaker similarity, while remaining competitive in downstream automatic speech recognition (ASR). These findings are also supported by a formal listening experiment. Extensive ablation studies and cross-dataset evaluation confirm the effectiveness and generalization capabilities of our approach.

Via

Access Paper or Ask Questions

Live Iterative Ptychography with projection-based algorithms

Sep 19, 2023

Simon Welker, Tal Peer, Henry N. Chapman, Timo Gerkmann

Abstract:In this work, we demonstrate that the ptychographic phase problem can be solved in a live fashion during scanning, while data is still being collected. We propose a generally applicable modification of the widespread projection-based algorithms such as Error Reduction (ER) and Difference Map (DM). This novel variant of ptychographic phase retrieval enables immediate visual feedback during experiments, reconstruction of arbitrary-sized objects with a fixed amount of computational resources, and adaptive scanning. By building upon the Real-Time Iterative Spectrogram Inversion (RTISI) family of algorithms from the audio processing literature, we show that live variants of projection-based methods such as DM can be derived naturally and may even achieve higher-quality reconstructions than their classic non-live counterparts with comparable effective computational load.

* Submitted to ICASSP 24

Via

Access Paper or Ask Questions

A Flexible Online Framework for Projection-Based STFT Phase Retrieval

Sep 13, 2023

Tal Peer, Simon Welker, Johannes Kolhoff, Timo Gerkmann

Abstract:Several recent contributions in the field of iterative STFT phase retrieval have demonstrated that the performance of the classical Griffin-Lim method can be considerably improved upon. By using the same projection operators as Griffin-Lim, but combining them in innovative ways, these approaches achieve better results in terms of both reconstruction quality and required number of iterations, while retaining a similar computational complexity per iteration. However, like Griffin-Lim, these algorithms operate in an offline manner and thus require an entire spectrogram as input, which is an unrealistic requirement for many real-world speech communication applications. We propose to extend RTISI -- an existing online (frame-by-frame) variant of the Griffin-Lim algorithm -- into a flexible framework that enables straightforward online implementation of any algorithm based on iterative projections. We further employ this framework to implement online variants of the fast Griffin-Lim algorithm, the accelerated Griffin-Lim algorithm, and two algorithms from the optics domain. Evaluation results on speech signals show that, similarly to the offline case, these algorithms can achieve a considerable performance gain compared to RTISI.

* Submitted to ICASSP 24

Via

Access Paper or Ask Questions

On the Behavior of Intrusive and Non-intrusive Speech Enhancement Metrics in Predictive and Generative Settings

Jun 05, 2023

Danilo de Oliveira, Julius Richter, Jean-Marie Lemercier, Tal Peer, Timo Gerkmann

Abstract:Since its inception, the field of deep speech enhancement has been dominated by predictive (discriminative) approaches, such as spectral mapping or masking. Recently, however, novel generative approaches have been applied to speech enhancement, attaining good denoising performance with high subjective quality scores. At the same time, advances in deep learning also allowed for the creation of neural network-based metrics, which have desirable traits such as being able to work without a reference (non-intrusively). Since generatively enhanced speech tends to exhibit radically different residual distortions, its evaluation using instrumental speech metrics may behave differently compared to predictively enhanced speech. In this paper, we evaluate the performance of the same speech enhancement backbone trained under predictive and generative paradigms on a variety of metrics and show that intrusive and non-intrusive measures correlate differently for each paradigm. This analysis motivates the search for metrics that can together paint a complete and unbiased picture of speech enhancement performance, irrespective of the model's training process.

* Submitted to ITG Conference on Speech Communication

Via

Access Paper or Ask Questions

Speech Signal Improvement Using Causal Generative Diffusion Models

Mar 15, 2023

Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Tal Peer, Timo Gerkmann

Figure 1 for Speech Signal Improvement Using Causal Generative Diffusion Models

Figure 2 for Speech Signal Improvement Using Causal Generative Diffusion Models

Abstract:In this paper, we present a causal speech signal improvement system that is designed to handle different types of distortions. The method is based on a generative diffusion model which has been shown to work well in scenarios with missing data and non-linear corruptions. To guarantee causal processing, we modify the network architecture of our previous work and replace global normalization with causal adaptive gain control. We generate diverse training data containing a broad range of distortions. This work was performed in the context of an "ICASSP Signal Processing Grand Challenge" and submitted to the non-real-time track of the "Speech Signal Improvement Challenge 2023", where it was ranked fifth.

* Accepted by ICASSP 2023

Via

Access Paper or Ask Questions

DiffPhase: Generative Diffusion-based STFT Phase Retrieval

Nov 08, 2022

Tal Peer, Simon Welker, Timo Gerkmann

Abstract:Diffusion probabilistic models have been recently used in a variety of tasks, including speech enhancement and synthesis. As a generative approach, diffusion models have been shown to be especially suitable for imputation problems, where missing data is generated based on existing data. Phase retrieval is inherently an imputation problem, where phase information has to be generated based on the given magnitude. In this work we build upon previous work in the speech domain, adapting a speech enhancement diffusion model specifically for STFT phase retrieval. Evaluation using speech quality and intelligibility metrics shows the diffusion approach is well-suited to the phase retrieval task, with performance surpassing both classical and modern methods.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes

Jun 23, 2022

Danilo de Oliveira, Tal Peer, Timo Gerkmann

Figure 1 for Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes

Figure 2 for Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes

Figure 3 for Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes

Figure 4 for Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes

Abstract:The SepFormer architecture shows very good results in speech separation. Like other learned-encoder models, it uses short frames, as they have been shown to obtain better performance in these cases. This results in a large number of frames at the input, which is problematic; since the SepFormer is transformer-based, its computational complexity drastically increases with longer sequences. In this paper, we employ the SepFormer in a speech enhancement task and show that by replacing the learned-encoder features with a magnitude short-time Fourier transform (STFT) representation, we can use long frames without compromising perceptual enhancement performance. We obtained equivalent quality and intelligibility evaluation scores while reducing the number of operations by a factor of approximately 8 for a 10-second utterance.

* Accepted at Interspeech 2022

Via

Access Paper or Ask Questions

Beyond Griffin-Lim: Improved Iterative Phase Retrieval for Speech

May 11, 2022

Tal Peer, Simon Welker, Timo Gerkmann

Figure 1 for Beyond Griffin-Lim: Improved Iterative Phase Retrieval for Speech

Figure 2 for Beyond Griffin-Lim: Improved Iterative Phase Retrieval for Speech

Figure 3 for Beyond Griffin-Lim: Improved Iterative Phase Retrieval for Speech

Figure 4 for Beyond Griffin-Lim: Improved Iterative Phase Retrieval for Speech

Abstract:Phase retrieval is a problem encountered not only in speech and audio processing, but in many other fields such as optics. Iterative algorithms based on non-convex set projections are effective and frequently used for retrieving the phase when only STFT magnitudes are available. While the basic Griffin-Lim algorithm and its variants have been the prevalent method for decades, more recent advances, e.g. in optics, raise the question: Can we do better than Griffin-Lim for speech signals, using the same principle of iterative projection? In this paper we compare the classical algorithms in the speech domain with two modern methods from optics with respect to reconstruction quality and convergence rate. Based on this study, we propose to combine Griffin-Lim with the Difference Map algorithm in a hybrid approach which shows superior results, in terms of both convergence and quality of the final reconstruction.

* Submitted to IWAENC 2022

Via

Access Paper or Ask Questions

Phase-Aware Deep Speech Enhancement: It's All About The Frame Length

Mar 30, 2022

Tal Peer, Timo Gerkmann

Figure 1 for Phase-Aware Deep Speech Enhancement: It's All About The Frame Length

Figure 2 for Phase-Aware Deep Speech Enhancement: It's All About The Frame Length

Figure 3 for Phase-Aware Deep Speech Enhancement: It's All About The Frame Length

Figure 4 for Phase-Aware Deep Speech Enhancement: It's All About The Frame Length

Abstract:While phase-aware speech processing has been receiving increasing attention in recent years, most narrowband STFT approaches with frame lengths of about 32ms show a rather modest impact of phase on overall performance. At the same time, modern deep neural network (DNN)-based approaches, like Conv-TasNet, that implicitly modify both magnitude and phase yield great performance on very short frames (2ms). Motivated by this observation, in this paper we systematically investigate the role of phase and magnitude in DNN-based speech enhancement for different frame lengths. The results show that a phase-aware DNN can take advantage of what previous studies concerning reconstruction of clean speech have shown: When using short frames, the phase spectrum becomes more important while the importance of the magnitude spectrum decreases. Furthermore, our experiments show that when both magnitude and phase are estimated, shorter frames result in a considerably improved performance in a DNN with explicit phase estimation. Contrarily, in the phase-blind case, where only magnitudes are processed, 32ms frames lead to the best performance. We conclude that DNN-based phase estimation benefits from the use of shorter frames and recommend a frame length of about 4ms for future phase-aware deep speech enhancement methods.

* Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Integrating Statistical Uncertainty into Neural Network-Based Speech Enhancement

Mar 04, 2022

Huajian Fang, Tal Peer, Stefan Wermter, Timo Gerkmann

Figure 1 for Integrating Statistical Uncertainty into Neural Network-Based Speech Enhancement

Figure 2 for Integrating Statistical Uncertainty into Neural Network-Based Speech Enhancement

Figure 3 for Integrating Statistical Uncertainty into Neural Network-Based Speech Enhancement

Figure 4 for Integrating Statistical Uncertainty into Neural Network-Based Speech Enhancement

Abstract:Speech enhancement in the time-frequency domain is often performed by estimating a multiplicative mask to extract clean speech. However, most neural network-based methods perform point estimation, i.e., their output consists of a single mask. In this paper, we study the benefits of modeling uncertainty in neural network-based speech enhancement. For this, our neural network is trained to map a noisy spectrogram to the Wiener filter and its associated variance, which quantifies uncertainty, based on the maximum a posteriori (MAP) inference of spectral coefficients. By estimating the distribution instead of the point estimate, one can model the uncertainty associated with each estimate. We further propose to use the estimated Wiener filter and its uncertainty to build an approximate MAP (A-MAP) estimator of spectral magnitudes, which in turn is combined with the MAP inference of spectral coefficients to form a hybrid loss function to jointly reinforce the estimation. Experimental results on different datasets show that the proposed method can not only capture the uncertainty associated with the estimated filters, but also yield a higher enhancement performance over comparable models that do not take uncertainty into account.

* ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
* \copyright 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Via

Access Paper or Ask Questions