Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dejan Marković

ComplexDec: A Domain-robust High-fidelity Neural Audio Codec with Complex Spectrum Modeling

Feb 04, 2025

Yi-Chiao Wu, Dejan Marković, Steven Krenn, Israel D. Gebru, Alexander Richard

Figure 1 for ComplexDec: A Domain-robust High-fidelity Neural Audio Codec with Complex Spectrum Modeling

Figure 2 for ComplexDec: A Domain-robust High-fidelity Neural Audio Codec with Complex Spectrum Modeling

Figure 3 for ComplexDec: A Domain-robust High-fidelity Neural Audio Codec with Complex Spectrum Modeling

Figure 4 for ComplexDec: A Domain-robust High-fidelity Neural Audio Codec with Complex Spectrum Modeling

Abstract:Neural audio codecs have been widely adopted in audio-generative tasks because their compact and discrete representations are suitable for both large-language-model-style and regression-based generative models. However, most neural codecs struggle to model out-of-domain audio, resulting in error propagations to downstream generative tasks. In this paper, we first argue that information loss from codec compression degrades out-of-domain robustness. Then, we propose full-band 48~kHz ComplexDec with complex spectral input and output to ease the information loss while adopting the same 24~kbps bitrate as the baseline AuidoDec and ScoreDec. Objective and subjective evaluations demonstrate the out-of-domain robustness of ComplexDec trained using only the 30-hour VCTK corpus.

* 5 pages, 2 figures, 2 tables. Proc. ICASSP, 2025

Via

Access Paper or Ask Questions

ScoreDec: A Phase-preserving High-Fidelity Audio Codec with A Generalized Score-based Diffusion Post-filter

Jan 22, 2024

Yi-Chiao Wu, Dejan Marković, Steven Krenn, Israel D. Gebru, Alexander Richard

Abstract:Although recent mainstream waveform-domain end-to-end (E2E) neural audio codecs achieve impressive coded audio quality with a very low bitrate, the quality gap between the coded and natural audio is still significant. A generative adversarial network (GAN) training is usually required for these E2E neural codecs because of the difficulty of direct phase modeling. However, such adversarial learning hinders these codecs from preserving the original phase information. To achieve human-level naturalness with a reasonable bitrate, preserve the original phase, and get rid of the tricky and opaque GAN training, we develop a score-based diffusion post-filter (SPF) in the complex spectral domain and combine our previous AudioDec with the SPF to propose ScoreDec, which can be trained using only spectral and score-matching losses. Both the objective and subjective experimental results show that ScoreDec with a 24~kbps bitrate encodes and decodes full-band 48~kHz speech with human-level naturalness and well-preserved phase information.

* 5 pages, 3 figures, 2 tables. Proc. ICASSP, 2024

Via

Access Paper or Ask Questions

AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

May 26, 2023

Yi-Chiao Wu, Israel D. Gebru, Dejan Marković, Alexander Richard

Figure 1 for AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

Figure 2 for AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

Figure 3 for AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

Figure 4 for AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

Abstract:A good audio codec for live applications such as telecommunication is characterized by three key properties: (1) compression, i.e.\ the bitrate that is required to transmit the signal should be as low as possible; (2) latency, i.e.\ encoding and decoding the signal needs to be fast enough to enable communication without or with only minimal noticeable delay; and (3) reconstruction quality of the signal. In this work, we propose an open-source, streamable, and real-time neural audio codec that achieves strong performance along all three axes: it can reconstruct highly natural sounding 48~kHz speech signals while operating at only 12~kbps and running with less than 6~ms (GPU)/10~ms (CPU) latency. An efficient training paradigm is also demonstrated for developing such neural audio codecs for real-world scenarios. Both objective and subjective evaluations using the VCTK corpus are provided. To sum up, AudioDec is a well-developed plug-and-play benchmark for audio codec applications.

* 5 pages, 1 figure, 5 tables. Proc. ICASSP, 2023

Via

Access Paper or Ask Questions