Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ritwik Giri

Masked Autoencoders as Universal Speech Enhancer

Feb 02, 2026

Rajalaxmi Rajagopalan, Ritwik Giri, Zhiqiang Tang, Kyu Han

Abstract:Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific downstream tasks. We evaluate the pre-trained features for denoising and dereverberation downstream tasks. We explore different augmentations (like single or multi-speaker) in the pre-training augmentation stack and the effect of different noisy input feature representations (like $log1p$ compression) on pre-trained embeddings and downstream fine-tuning enhancement performance. We show that the proposed method not only outperforms the baseline but also achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.

Via

Access Paper or Ask Questions

Real-time Stereo Speech Enhancement with Spatial-Cue Preservation based on Dual-Path Structure

Feb 01, 2024

Masahito Togami, Jean-Marc Valin, Karim Helwani, Ritwik Giri, Umut Isik, Michael M. Goodwin

Abstract:We introduce a real-time, multichannel speech enhancement algorithm which maintains the spatial cues of stereo recordings including two speech sources. Recognizing that each source has unique spatial information, our method utilizes a dual-path structure, ensuring the spatial cues remain unaffected during enhancement by applying source-specific common-band gain. This method also seamlessly integrates pretrained monaural speech enhancement, eliminating the need for retraining on stereo inputs. Source separation from stereo mixtures is achieved via spatial beamforming, with the steering vector for each source being adaptively updated using post-enhancement output signal. This ensures accurate tracking of the spatial information. The final stereo output is derived by merging the spatial images of the enhanced sources, with its efficacy not heavily reliant on the separation performance of the beamforming. The algorithm runs in real-time on 10-ms frames with a 40 ms of look-ahead. Evaluations reveal its effectiveness in enhancing speech and preserving spatial cues in both fully and sparsely overlapped mixtures.

* Accepted for ICASSP 2024, 5 pages

Via

Access Paper or Ask Questions

A Framework for Unified Real-time Personalized and Non-Personalized Speech Enhancement

Feb 23, 2023

Zhepei Wang, Ritwik Giri, Devansh Shah, Jean-Marc Valin, Michael M. Goodwin, Paris Smaragdis

Figure 1 for A Framework for Unified Real-time Personalized and Non-Personalized Speech Enhancement

Figure 2 for A Framework for Unified Real-time Personalized and Non-Personalized Speech Enhancement

Figure 3 for A Framework for Unified Real-time Personalized and Non-Personalized Speech Enhancement

Figure 4 for A Framework for Unified Real-time Personalized and Non-Personalized Speech Enhancement

Abstract:In this study, we present an approach to train a single speech enhancement network that can perform both personalized and non-personalized speech enhancement. This is achieved by incorporating a frame-wise conditioning input that specifies the type of enhancement output. To improve the quality of the enhanced output and mitigate oversuppression, we experiment with re-weighting frames by the presence or absence of speech activity and applying augmentations to speaker embeddings. By training under a multi-task learning setting, we empirically show that the proposed unified model obtains promising results on both personalized and non-personalized speech enhancement benchmarks and reaches similar performance to models that are trained specialized for either task. The strong performance of the proposed method demonstrates that the unified model is a more economical alternative compared to keeping separate task-specific models during inference.

* Accepted by ICASSP 2023

Via

Access Paper or Ask Questions

Semi-supervised Time Domain Target Speaker Extraction with Attention

Jun 18, 2022

Zhepei Wang, Ritwik Giri, Shrikant Venkataramani, Umut Isik, Jean-Marc Valin, Paris Smaragdis, Mike Goodwin, Arvindh Krishnaswamy

Figure 1 for Semi-supervised Time Domain Target Speaker Extraction with Attention

Figure 2 for Semi-supervised Time Domain Target Speaker Extraction with Attention

Figure 3 for Semi-supervised Time Domain Target Speaker Extraction with Attention

Figure 4 for Semi-supervised Time Domain Target Speaker Extraction with Attention

Abstract:In this work, we propose Exformer, a time-domain architecture for target speaker extraction. It consists of a pre-trained speaker embedder network and a separator network based on transformer encoder blocks. We study multiple methods to combine speaker information with the input mixture, and the resulting Exformer architecture obtains superior extraction performance compared to prior time-domain networks. Furthermore, we investigate a two-stage procedure to train the model using mixtures without reference signals upon a pre-trained supervised model. Experimental results show that the proposed semi-supervised learning procedure improves the performance of the supervised baselines.

Via

Access Paper or Ask Questions

To Dereverb Or Not to Dereverb? Perceptual Studies On Real-Time Dereverberation Targets

Jun 16, 2022

Jean-Marc Valin, Ritwik Giri, Shrikant Venkataramani, Umut Isik, Arvindh Krishnaswamy

Figure 1 for To Dereverb Or Not to Dereverb? Perceptual Studies On Real-Time Dereverberation Targets

Figure 2 for To Dereverb Or Not to Dereverb? Perceptual Studies On Real-Time Dereverberation Targets

Figure 3 for To Dereverb Or Not to Dereverb? Perceptual Studies On Real-Time Dereverberation Targets

Figure 4 for To Dereverb Or Not to Dereverb? Perceptual Studies On Real-Time Dereverberation Targets

Abstract:In real life, room effect, also known as room reverberation, and the present background noise degrade the quality of speech. Recently, deep learning-based speech enhancement approaches have shown a lot of promise and surpassed traditional denoising and dereverberation methods. It is also well established that these state-of-the-art denoising algorithms significantly improve the quality of speech as perceived by human listeners. But the role of dereverberation on subjective (perceived) speech quality, and whether the additional artifacts introduced by dereverberation cause more harm than good are still unclear. In this paper, we attempt to answer these questions by evaluating a state of the art speech enhancement system in a comprehensive subjective evaluation study for different choices of dereverberation targets.

* 5 pages

Via

Access Paper or Ask Questions

Improved singing voice separation with chromagram-based pitch-aware remixing

Mar 28, 2022

Siyuan Yuan, Zhepei Wang, Umut Isik, Ritwik Giri, Jean-Marc Valin, Michael M. Goodwin, Arvindh Krishnaswamy

Figure 1 for Improved singing voice separation with chromagram-based pitch-aware remixing

Figure 2 for Improved singing voice separation with chromagram-based pitch-aware remixing

Figure 3 for Improved singing voice separation with chromagram-based pitch-aware remixing

Figure 4 for Improved singing voice separation with chromagram-based pitch-aware remixing

Abstract:Singing voice separation aims to separate music into vocals and accompaniment components. One of the major constraints for the task is the limited amount of training data with separated vocals. Data augmentation techniques such as random source mixing have been shown to make better use of existing data and mildly improve model performance. We propose a novel data augmentation technique, chromagram-based pitch-aware remixing, where music segments with high pitch alignment are mixed. By performing controlled experiments in both supervised and semi-supervised settings, we demonstrate that training models with pitch-aware remixing significantly improves the test signal-to-distortion ratio (SDR)

* To appear at ICASSP 2022, 5 pages, 1 figure

Via

Access Paper or Ask Questions

Personalized PercepNet: Real-time, Low-complexity Target Voice Separation and Enhancement

Jun 08, 2021

Ritwik Giri, Shrikant Venkataramani, Jean-Marc Valin, Umut Isik, Arvindh Krishnaswamy

Figure 1 for Personalized PercepNet: Real-time, Low-complexity Target Voice Separation and Enhancement

Figure 2 for Personalized PercepNet: Real-time, Low-complexity Target Voice Separation and Enhancement

Figure 3 for Personalized PercepNet: Real-time, Low-complexity Target Voice Separation and Enhancement

Figure 4 for Personalized PercepNet: Real-time, Low-complexity Target Voice Separation and Enhancement

Abstract:The presence of multiple talkers in the surrounding environment poses a difficult challenge for real-time speech communication systems considering the constraints on network size and complexity. In this paper, we present Personalized PercepNet, a real-time speech enhancement model that separates a target speaker from a noisy multi-talker mixture without compromising on complexity of the recently proposed PercepNet. To enable speaker-dependent speech enhancement, we first show how we can train a perceptually motivated speaker embedder network to produce a representative embedding vector for the given speaker. Personalized PercepNet uses the target speaker embedding as additional information to pick out and enhance only the target speaker while suppressing all other competing sounds. Our experiments show that the proposed model significantly outperforms PercepNet and other baselines, both in terms of objective speech enhancement metrics and human opinion scores.

* INTERSPEECH 2021, 5 pages

Via

Access Paper or Ask Questions

Semi-Supervised Singing Voice Separation with Noisy Self-Training

Feb 16, 2021

Zhepei Wang, Ritwik Giri, Umut Isik, Jean-Marc Valin, Arvindh Krishnaswamy

Figure 1 for Semi-Supervised Singing Voice Separation with Noisy Self-Training

Figure 2 for Semi-Supervised Singing Voice Separation with Noisy Self-Training

Figure 3 for Semi-Supervised Singing Voice Separation with Noisy Self-Training

Figure 4 for Semi-Supervised Singing Voice Separation with Noisy Self-Training

Abstract:Recent progress in singing voice separation has primarily focused on supervised deep learning methods. However, the scarcity of ground-truth data with clean musical sources has been a problem for long. Given a limited set of labeled data, we present a method to leverage a large volume of unlabeled data to improve the model's performance. Following the noisy self-training framework, we first train a teacher network on the small labeled dataset and infer pseudo-labels from the large corpus of unlabeled mixtures. Then, a larger student network is trained on combined ground-truth and self-labeled datasets. Empirical results show that the proposed self-training scheme, along with data augmentation methods, effectively leverage the large unlabeled corpus and obtain superior performance compared to supervised methods.

* Accepted at 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2021)

Via

Access Paper or Ask Questions

Enhancing into the codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders

Feb 12, 2021

Jonah Casebeer, Vinjai Vale, Umut Isik, Jean-Marc Valin, Ritwik Giri, Arvindh Krishnaswamy

Figure 1 for Enhancing into the codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders

Figure 2 for Enhancing into the codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders

Figure 3 for Enhancing into the codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders

Figure 4 for Enhancing into the codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders

Abstract:Audio codecs based on discretized neural autoencoders have recently been developed and shown to provide significantly higher compression levels for comparable quality speech output. However, these models are tightly coupled with speech content, and produce unintended outputs in noisy conditions. Based on VQ-VAE autoencoders with WaveRNN decoders, we develop compressor-enhancer encoders and accompanying decoders, and show that they operate well in noisy conditions. We also observe that a compressor-enhancer model performs better on clean speech inputs than a compressor model trained only on clean speech.

* 5 pages, 2 figures, ICASSP 2021

Via

Access Paper or Ask Questions

PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

Aug 11, 2020

Umut Isik, Ritwik Giri, Neerad Phansalkar, Jean-Marc Valin, Karim Helwani, Arvindh Krishnaswamy

Figure 1 for PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

Figure 2 for PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

Figure 3 for PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

Figure 4 for PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

Abstract:Neural network applications generally benefit from larger-sized models, but for current speech enhancement models, larger scale networks often suffer from decreased robustness to the variety of real-world use cases beyond what is encountered in training data. We introduce several innovations that lead to better large neural networks for speech enhancement. The novel PoCoNet architecture is a convolutional neural network that, with the use of frequency-positional embeddings, is able to more efficiently build frequency-dependent features in the early layers. A semi-supervised method helps increase the amount of conversational training data by pre-enhancing noisy datasets, improving performance on real recordings. A new loss function biased towards preserving speech quality helps the optimization better match human perceptual opinions on speech quality. Ablation experiments and objective and human opinion metrics show the benefits of the proposed improvements.

* 5 pages, 3 figures, INTERSPEECH 2020

Via

Access Paper or Ask Questions