Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kyogu Lee

TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument

Feb 13, 2025

Kyungsu Kim, Junghyun Koo, Sungho Lee, Haesun Joung, Kyogu Lee

Abstract:Recent advancements in neural audio codecs have enabled the use of tokenized audio representations in various audio generation tasks, such as text-to-speech, text-to-audio, and text-to-music generation. Leveraging this approach, we propose TokenSynth, a novel neural synthesizer that utilizes a decoder-only transformer to generate desired audio tokens from MIDI tokens and CLAP (Contrastive Language-Audio Pretraining) embedding, which has timbre-related information. Our model is capable of performing instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation without any fine-tuning. This flexibility enables diverse sound design and intuitive timbre control. We evaluated the quality of the synthesized audio, the timbral similarity between synthesized and target audio/text, and synthesis accuracy (i.e., how accurately it follows the input MIDI) using objective measures. TokenSynth demonstrates the potential of leveraging advanced neural audio codecs and transformers to create powerful and versatile neural synthesizers. The source code, model weights, and audio demos are available at: https://github.com/KyungsuKim42/tokensynth

* 5 pages, 1 figure, to be published in ICASSP 2025

Via

Access Paper or Ask Questions

Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control

Nov 20, 2024

Yunkee Chae, Eunsik Shin, Hwang Suntae, Seungryeol Paik, Kyogu Lee

Abstract:Lyrics generation presents unique challenges, particularly in achieving precise syllable control while adhering to song form structures such as verses and choruses. Conventional line-by-line approaches often lead to unnatural phrasing, underscoring the need for more granular syllable management. We propose a framework for lyrics generation that enables multi-level syllable control at the word, phrase, line, and paragraph levels, aware of song form. Our approach generates complete lyrics conditioned on input text and song form, ensuring alignment with specified syllable constraints. Generated lyrics samples are available at: https://tinyurl.com/lyrics9999

Via

Access Paper or Ask Questions

Do Captioning Metrics Reflect Music Semantic Alignment?

Nov 18, 2024

Jinwoo Lee, Kyogu Lee

Abstract:Music captioning has emerged as a promising task, fueled by the advent of advanced language generation models. However, the evaluation of music captioning relies heavily on traditional metrics such as BLEU, METEOR, and ROUGE which were developed for other domains, without proper justification for their use in this new field. We present cases where traditional metrics are vulnerable to syntactic changes, and show they do not correlate well with human judgments. By addressing these issues, we aim to emphasize the need for a critical reevaluation of how music captions are assessed.

* International Society for Music Information Retrieval (ISMIR) 2024, Late Breaking Demo (LBD)

Via

Access Paper or Ask Questions

VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression

Oct 12, 2024

Yunkee Chae, Woosung Choi, Yuhta Takida, Junghyun Koo, Yukara Ikemiya, Zhi Zhong, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Kyogu Lee, Wei-Hsiang Liao(+1 more)

Abstract:Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoff, particularly in scenarios with simple input audio, such as silence. To address this limitation, we propose variable bitrate RVQ (VRVQ) for audio codecs, which allows for more efficient coding by adapting the number of codebooks used per frame. Furthermore, we propose a gradient estimation method for the non-differentiable masking operation that transforms from the importance map to the binary importance mask, improving model training via a straight-through estimator. We demonstrate that the proposed training framework achieves superior results compared to the baseline method and shows further improvement when applied to the current state-of-the-art codec.

* Accepted at NeurIPS 2024 Workshop on Machine Learning and Compression

Via

Access Paper or Ask Questions

Variable Bitrate Residual Vector Quantization for Audio Coding

Oct 08, 2024

Yunkee Chae, Woosung Choi, Yuhta Takida, Junghyun Koo, Yukara Ikemiya, Zhi Zhong, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Kyogu Lee, Wei-Hsiang Liao(+1 more)

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

Hear Your Face: Face-based voice conversion with F0 estimation

Aug 19, 2024

Jaejun Lee, Yoori Oh, Injune Hwang, Kyogu Lee

Figure 1 for Hear Your Face: Face-based voice conversion with F0 estimation

Figure 2 for Hear Your Face: Face-based voice conversion with F0 estimation

Figure 3 for Hear Your Face: Face-based voice conversion with F0 estimation

Figure 4 for Hear Your Face: Face-based voice conversion with F0 estimation

Abstract:This paper delves into the emerging field of face-based voice conversion, leveraging the unique relationship between an individual's facial features and their vocal characteristics. We present a novel face-based voice conversion framework that particularly utilizes the average fundamental frequency of the target speaker, derived solely from their facial images. Through extensive analysis, our framework demonstrates superior speech generation quality and the ability to align facial features with voice characteristics, including tracking of the target speaker's fundamental frequency.

* Interspeech 2024

Via

Access Paper or Ask Questions

GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

Aug 06, 2024

Sungho Lee, Marco Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji

Figure 1 for GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

Figure 2 for GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

Figure 3 for GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

Abstract:We present GRAFX, an open-source library designed for handling audio processing graphs in PyTorch. Along with various library functionalities, we describe technical details on the efficient parallel computation of input graphs, signals, and processor parameters in GPU. Then, we show its example use under a music mixing scenario, where parameters of every differentiable processor in a large graph are optimized via gradient descent. The code is available at https://github.com/sh-lee97/grafx.

* Accepted to DAFx 2024 demo

Via

Access Paper or Ask Questions

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Jul 29, 2024

Seungyeon Rhyu, Kichang Yang, Sungjun Cho, Jaehyeon Kim, Kyogu Lee, Moontae Lee

Figure 1 for Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Figure 2 for Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Figure 3 for Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Figure 4 for Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Abstract:Music generation introduces challenging complexities to large language models. Symbolic structures of music often include vertical harmonization as well as horizontal counterpoint, urging various adaptations and enhancements for large-scale Transformers. However, existing works share three major drawbacks: 1) their tokenization requires domain-specific annotations, such as bars and beats, that are typically missing in raw MIDI data; 2) the pure impact of enhancing token embedding methods is hardly examined without domain-specific annotations; and 3) existing works to overcome the aforementioned drawbacks, such as MuseNet, lack reproducibility. To tackle such limitations, we develop a MIDI-based music generation framework inspired by MuseNet, empirically studying two structural embeddings that do not rely on domain-specific annotations. We provide various metrics and insights that can guide suitable encoding to deploy. We also verify that multiple embedding configurations can selectively boost certain musical aspects. By providing open-source implementations via HuggingFace, our findings shed light on leveraging large language models toward practical and reproducible music generation.

* 9 pages, 6 figures, 4 tables

Via

Access Paper or Ask Questions

Wavespace: A Highly Explorable Wavetable Generator

Jul 29, 2024

Hazounne Lee, Kihong Kim, Sungho Lee, Kyogu Lee

Figure 1 for Wavespace: A Highly Explorable Wavetable Generator

Figure 2 for Wavespace: A Highly Explorable Wavetable Generator

Figure 3 for Wavespace: A Highly Explorable Wavetable Generator

Figure 4 for Wavespace: A Highly Explorable Wavetable Generator

Abstract:Wavetable synthesis generates quasi-periodic waveforms of musical tones by interpolating a list of waveforms called wavetable. As generative models that utilize latent representations offer various methods in waveform generation for musical applications, studies in wavetable generation with invertible architecture have also arisen recently. While they are promising, it is still challenging to generate wavetables with detailed controls in disentangling factors within the latent representation. In response, we present Wavespace, a novel framework for wavetable generation that empowers users with enhanced parameter controls. Our model allows users to apply pre-defined conditions to the output wavetables. We employ a variational autoencoder and completely factorize its latent space to different waveform styles. We also condition the generator with auxiliary timbral and morphological descriptors. This way, users can create unique wavetables by independently manipulating each latent subspace and descriptor parameters. Our framework is efficient enough for practical use; we prototyped an oscillator plug-in as a proof of concept for real-time integration of Wavespace within digital audio workspaces (DAWs).

Via

Access Paper or Ask Questions

Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation

Jul 07, 2024

Jin Woo Lee, Jaehyun Park, Min Jun Choi, Kyogu Lee

Figure 1 for Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation

Figure 2 for Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation

Figure 3 for Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation

Figure 4 for Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation

Abstract:While significant advancements have been made in music generation and differentiable sound synthesis within machine learning and computer audition, the simulation of instrument vibration guided by physical laws has been underexplored. To address this gap, we introduce a novel model for simulating the spatio-temporal motion of nonlinear strings, integrating modal synthesis and spectral modeling within a neural network framework. Our model leverages physical properties and fundamental frequencies as inputs, outputting string states across time and space that solve the partial differential equation characterizing the nonlinear string. Empirical evaluations demonstrate that the proposed architecture achieves superior accuracy in string motion simulation compared to existing baseline architectures. The code and demo are available online.

Via

Access Paper or Ask Questions