Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bernd Edler

FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates

Sep 26, 2024

Nicola Pia, Martin Strauss, Markus Multrus, Bernd Edler

Abstract:This paper introduces FlowMAC, a novel neural audio codec for high-quality general audio compression at low bit rates based on conditional flow matching (CFM). FlowMAC jointly learns a mel spectrogram encoder, quantizer and decoder. At inference time the decoder integrates a continuous normalizing flow via an ODE solver to generate a high-quality mel spectrogram. This is the first time that a CFM-based approach is applied to general audio coding, enabling a scalable, simple and memory efficient training. Our subjective evaluations show that FlowMAC at 3 kbps achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate. Moreover, FlowMAC offers a tunable inference pipeline, which permits to trade off complexity and quality. This enables real-time coding on CPU, while maintaining high perceptual quality.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

SEFGAN: Harvesting the Power of Normalizing Flows and GANs for Efficient High-Quality Speech Enhancement

Dec 04, 2023

Martin Strauss, Nicola Pia, Nagashree K. S. Rao, Bernd Edler

Abstract:This paper proposes SEFGAN, a Deep Neural Network (DNN) combining maximum likelihood training and Generative Adversarial Networks (GANs) for efficient speech enhancement (SE). For this, a DNN is trained to synthesize the enhanced speech conditioned on noisy speech using a Normalizing Flow (NF) as generator in a GAN framework. While the combination of likelihood models and GANs is not trivial, SEFGAN demonstrates that a hybrid adversarial and maximum likelihood training approach enables the model to maintain high quality audio generation and log-likelihood estimation. Our experiments indicate that this approach strongly outperforms the baseline NF-based model without introducing additional complexity to the enhancement network. A comparison using computational metrics and a listening experiment reveals that SEFGAN is competitive with other state-of-the-art models.

* Preprint. Accepted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2023

Via

Access Paper or Ask Questions

Predicting Preferred Dialogue-to-Background Loudness Difference in Dialogue-Separated Audio

May 31, 2023

Luca Resti, Martin Strauss, Matteo Torcoli, Emanuël Habets, Bernd Edler

Figure 1 for Predicting Preferred Dialogue-to-Background Loudness Difference in Dialogue-Separated Audio

Figure 2 for Predicting Preferred Dialogue-to-Background Loudness Difference in Dialogue-Separated Audio

Figure 3 for Predicting Preferred Dialogue-to-Background Loudness Difference in Dialogue-Separated Audio

Figure 4 for Predicting Preferred Dialogue-to-Background Loudness Difference in Dialogue-Separated Audio

Abstract:Dialogue Enhancement (DE) enables the rebalancing of dialogue and background sounds to fit personal preferences and needs in the context of broadcast audio. When individual audio stems are unavailable from production, Dialogue Separation (DS) can be applied to the final audio mixture to obtain estimates of these stems. This work focuses on Preferred Loudness Differences (PLDs) between dialogue and background sounds. While previous studies determined the PLD through a listening test employing original stems from production, stems estimated by DS are used in the present study. In addition, a larger variety of signal classes is considered. PLDs vary substantially across individuals (average interquartile range: 5.7 LU). Despite this variability, PLDs are found to be highly dependent on the signal type under consideration, and it is shown that median PLDs can be predicted using objective intelligibility metrics. Two existing baseline prediction methods - intended for use with original stems - displayed a Mean Absolute Error (MAE) of 7.5 LU and 5 LU, respectively. A modified baseline (MAE: 3.2 LU) and an alternative approach (MAE: 2.5 LU) are proposed. Results support the viability of processing final broadcast mixtures with DS and offering an alternative remixing that accounts for median PLDs.

* Paper accepted at the 15th International Conference on Quality of Multimedia Experience (QoMEX), 4 pages, 2 figures

Via

Access Paper or Ask Questions

A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT Domain

Jan 28, 2022

Kishan Gupta, Srikanth Korse, Bernd Edler, Guillaume Fuchs

Figure 1 for A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT Domain

Figure 2 for A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT Domain

Figure 3 for A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT Domain

Figure 4 for A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT Domain

Abstract:Frequency domain processing, and in particular the use of Modified Discrete Cosine Transform (MDCT), is the most widespread approach to audio coding. However, at low bitrates, audio quality, especially for speech, degrades drastically due to the lack of available bits to directly code the transform coefficients. Traditionally, post-filtering has been used to mitigate artefacts in the coded speech by exploiting a-priori information of the source and extra transmitted parameters. Recently, data-driven post-filters have shown better results, but at the cost of significant additional complexity and delay. In this work, we propose a mask-based post-filter operating directly in MDCT domain of the codec, inducing no extra delay. The real-valued mask is applied to the quantized MDCT coefficients and is estimated from a relatively lightweight convolutional encoder-decoder network. Our solution is tested on the recently standardized low-delay, low-complexity codec (LC3) at lowest possible bitrate of 16 kbps. Objective and subjective assessments clearly show the advantage of this approach over the conventional post-filter, with an average improvement of 10 MUSHRA points over the LC3 coded speech.

Via

Access Paper or Ask Questions

A Hands-on Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation

Jun 22, 2021

Martin Strauss, Jouni Paulus, Matteo Torcoli, Bernd Edler

Figure 1 for A Hands-on Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation

Figure 2 for A Hands-on Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation

Abstract:This paper describes a hands-on comparison on using state-of-the-art music source separation deep neural networks (DNNs) before and after task-specific fine-tuning for separating speech content from non-speech content in broadcast audio (i.e., dialog separation). The music separation models are selected as they share the number of channels (2) and sampling rate (44.1 kHz or higher) with the considered broadcast content, and vocals separation in music is considered as a parallel for dialog separation in the target application domain. These similarities are assumed to enable transfer learning between the tasks. Three models pre-trained on music (Open-Unmix, Spleeter, and Conv-TasNet) are considered in the experiments, and fine-tuned with real broadcast data. The performance of the models is evaluated before and after fine-tuning with computational evaluation metrics (SI-SIRi, SI-SDRi, 2f-model), as well as with a listening test simulating an application where the non-speech signal is partially attenuated, e.g., for better speech intelligibility. The evaluations include two reference systems specifically developed for dialog separation. The results indicate that pre-trained music source separation models can be used for dialog separation to some degree, and that they benefit from the fine-tuning, reaching a performance close to task-specific solutions.

* accepted in INTERSPEECH 2021

Via

Access Paper or Ask Questions

A Flow-Based Neural Network for Time Domain Speech Enhancement

Jun 16, 2021

Martin Strauss, Bernd Edler

Figure 1 for A Flow-Based Neural Network for Time Domain Speech Enhancement

Figure 2 for A Flow-Based Neural Network for Time Domain Speech Enhancement

Figure 3 for A Flow-Based Neural Network for Time Domain Speech Enhancement

Figure 4 for A Flow-Based Neural Network for Time Domain Speech Enhancement

Abstract:Speech enhancement involves the distinction of a target speech signal from an intrusive background. Although generative approaches using Variational Autoencoders or Generative Adversarial Networks (GANs) have increasingly been used in recent years, normalizing flow (NF) based systems are still scarse, despite their success in related fields. Thus, in this paper we propose a NF framework to directly model the enhancement process by density estimation of clean speech utterances conditioned on their noisy counterpart. The WaveGlow model from speech synthesis is adapted to enable direct enhancement of noisy utterances in time domain. In addition, we demonstrate that nonlinear input companding benefits the model performance by equalizing the distribution of input samples. Experimental evaluation on a publicly available dataset shows comparable results to current state-of-the-art GAN-based approaches, while surpassing the chosen baselines using objective evaluation metrics.

* Accepted to ICASSP 2021

Via

Access Paper or Ask Questions