Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Jun 15, 2024

Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli

Figure 1 for How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Figure 2 for How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Figure 3 for How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Figure 4 for How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Share this with someone who'll enjoy it:

Abstract:Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and Neural compression-based tokens (codecs). Although previous studies have benchmarked codec models to identify optimal configurations, the ideal setup for quantizing pretrained SSL models remains unclear. This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers. Furthermore, an attention mechanism is employed to identify task-specific influential layers, enhancing the adaptability and performance of semantic tokens in diverse audio applications.

* 4 pages, 2 figures, 2 tables, Accepted at Interspeech 2024

View paper on

Share this with someone who'll enjoy it:

Title:How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Paper and Code