Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

Oct 12, 2023

Xinfa Zhu, Yuanjun Lv, Yi Lei, Tao Li, Wendi He, Hongbin Zhou, Heng Lu, Lei Xie

Figure 1 for Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

Figure 2 for Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

Figure 3 for Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

Figure 4 for Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

Share this with someone who'll enjoy it:

Abstract:Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at https://github.com/BakerBunker/VecTok .

* 15 pages, 2 figures

View paper on

Share this with someone who'll enjoy it:

Title:Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

Paper and Code