Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jongmo Sung

Scalable and Efficient Neural Speech Coding

Mar 27, 2021

Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beak, Minje Kim

Figure 1 for Scalable and Efficient Neural Speech Coding

Figure 2 for Scalable and Efficient Neural Speech Coding

Figure 3 for Scalable and Efficient Neural Speech Coding

Figure 4 for Scalable and Efficient Neural Speech Coding

Abstract:This work presents a scalable and efficient neural waveform codec (NWC) for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as its feedforward routine. The proposed CNN autoencoder also defines quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process. We achieve efficiency by introducing compact model architectures to our fully convolutional network model, such as gated residual networks and depthwise separable convolution. Furthermore, the proposed models are with a scalable architecture, cross-module residual learning (CMRL), to cover a wide range of bitrates. To this end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where an NWC module performs residual coding to restore any reconstruction loss that its preceding modules have created. CMRL can scale down to cover lower bitrates as well, for which it employs linear predictive coding (LPC) module as its first autoencoder. We redefine LPC's quantization as a trainable module to enhance the bit allocation tradeoff between LPC and its following NWC modules. Compared to the other autoregressive decoder-based neural speech coders, our decoder has significantly smaller architecture, e.g., with only 0.12 million parameters, more than 100 times smaller than a WaveNet decoder. Compared to the LPCNet-based speech codec, which leverages the speech production model to reduce the network complexity in low bitrates, ours can scale up to higher bitrates to achieve transparent performance. Our lightweight neural speech coding model achieves comparable subjective scores against AMR-WB at the low bitrate range and provides transparent coding quality at 32 kbps.

* in submission to IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE/ACM TASLP)

Via

Access Paper or Ask Questions

Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

Dec 31, 2020

Kai Zhen, Mi Suk Lee, Jongmo Sung, Seungkwon Beack, Minje Kim

Figure 1 for Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

Figure 2 for Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

Figure 3 for Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

Figure 4 for Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

Abstract:Conventional audio coding technologies commonly leverage human perception of sound, or psychoacoustics, to reduce the bitrate while preserving the perceptual quality of the decoded audio signals. For neural audio codecs, however, the objective nature of the loss function usually leads to suboptimal sound quality as well as high run-time complexity due to the large model size. In this work, we present a psychoacoustic calibration scheme to re-define the loss functions of neural audio coding systems so that it can decode signals more perceptually similar to the reference, yet with a much lower model complexity. The proposed loss function incorporates the global masking threshold, allowing the reconstruction error that corresponds to inaudible artifacts. Experimental results show that the proposed model outperforms the baseline neural codec twice as large and consuming 23.4% more bits per second. With the proposed method, a lightweight neural codec, with only 0.9 million parameters, performs near-transparent audio coding comparable with the commercial MPEG-1 Audio Layer III codec at 112 kbps.

* IEEE Signal Processing Letters, vol. 27, pp. 2159-2163, 2020

Via

Access Paper or Ask Questions

Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding

Jun 18, 2019

Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beack, Minje Kim

Figure 1 for Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding

Figure 2 for Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding

Figure 3 for Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding

Figure 4 for Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding

Abstract:Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. CMRL differs from other DNN-based speech codecs, in that rather than modeling speech compression problem in a single large neural network, it optimizes a series of less-complicated modules in a two-phase training scheme. The proposed method shows better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture. As an end-to-end model, it takes raw PCM signals as an input, but is also compatible with linear predictive coding (LPC), showing better subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved by using only 0.9 million trainable parameters, a significantly less complex architecture than the other DNN-based codecs in the literature.

* Accepted for publication in INTERSPEECH 2019

Via

Access Paper or Ask Questions