Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sungho Lee

TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument

Feb 13, 2025

Kyungsu Kim, Junghyun Koo, Sungho Lee, Haesun Joung, Kyogu Lee

Abstract:Recent advancements in neural audio codecs have enabled the use of tokenized audio representations in various audio generation tasks, such as text-to-speech, text-to-audio, and text-to-music generation. Leveraging this approach, we propose TokenSynth, a novel neural synthesizer that utilizes a decoder-only transformer to generate desired audio tokens from MIDI tokens and CLAP (Contrastive Language-Audio Pretraining) embedding, which has timbre-related information. Our model is capable of performing instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation without any fine-tuning. This flexibility enables diverse sound design and intuitive timbre control. We evaluated the quality of the synthesized audio, the timbral similarity between synthesized and target audio/text, and synthesis accuracy (i.e., how accurately it follows the input MIDI) using objective measures. TokenSynth demonstrates the potential of leveraging advanced neural audio codecs and transformers to create powerful and versatile neural synthesizers. The source code, model weights, and audio demos are available at: https://github.com/KyungsuKim42/tokensynth

* 5 pages, 1 figure, to be published in ICASSP 2025

Via

Access Paper or Ask Questions

GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

Aug 06, 2024

Sungho Lee, Marco Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji

Figure 1 for GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

Figure 2 for GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

Figure 3 for GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

Abstract:We present GRAFX, an open-source library designed for handling audio processing graphs in PyTorch. Along with various library functionalities, we describe technical details on the efficient parallel computation of input graphs, signals, and processor parameters in GPU. Then, we show its example use under a music mixing scenario, where parameters of every differentiable processor in a large graph are optimized via gradient descent. The code is available at https://github.com/sh-lee97/grafx.

* Accepted to DAFx 2024 demo

Via

Access Paper or Ask Questions

Wavespace: A Highly Explorable Wavetable Generator

Jul 29, 2024

Hazounne Lee, Kihong Kim, Sungho Lee, Kyogu Lee

Figure 1 for Wavespace: A Highly Explorable Wavetable Generator

Figure 2 for Wavespace: A Highly Explorable Wavetable Generator

Figure 3 for Wavespace: A Highly Explorable Wavetable Generator

Figure 4 for Wavespace: A Highly Explorable Wavetable Generator

Abstract:Wavetable synthesis generates quasi-periodic waveforms of musical tones by interpolating a list of waveforms called wavetable. As generative models that utilize latent representations offer various methods in waveform generation for musical applications, studies in wavetable generation with invertible architecture have also arisen recently. While they are promising, it is still challenging to generate wavetables with detailed controls in disentangling factors within the latent representation. In response, we present Wavespace, a novel framework for wavetable generation that empowers users with enhanced parameter controls. Our model allows users to apply pre-defined conditions to the output wavetables. We employ a variational autoencoder and completely factorize its latent space to different waveform styles. We also condition the generator with auxiliary timbral and morphological descriptors. This way, users can create unique wavetables by independently manipulating each latent subspace and descriptor parameters. Our framework is efficient enough for practical use; we prototyped an oscillator plug-in as a proof of concept for real-time integration of Wavespace within digital audio workspaces (DAWs).

Via

Access Paper or Ask Questions

Beat-Aligned Spectrogram-to-Sequence Generation of Rhythm-Game Charts

Nov 22, 2023

Jayeon Yi, Sungho Lee, Kyogu Lee

Abstract:In the heart of "rhythm games" - games where players must perform actions in sync with a piece of music - are "charts", the directives to be given to players. We newly formulate chart generation as a sequence generation task and train a Transformer using a large dataset. We also introduce tempo-informed preprocessing and training procedures, some of which are suggested to be integral for a successful training. Our model is found to outperform the baselines on a large dataset, and is also found to benefit from pretraining and finetuning.

* ISMIR 2023 LBD. Demo videos and code at stet-stet.github.io/goct

Via

Access Paper or Ask Questions

Yet Another Generative Model For Room Impulse Response Estimation

Nov 05, 2023

Sungho Lee, Hyeong-Seok Choi, Kyogu Lee

Abstract:Recent neural room impulse response (RIR) estimators typically comprise an encoder for reference audio analysis and a generator for RIR synthesis. Especially, it is the performance of the generator that directly influences the overall estimation quality. In this context, we explore an alternate generator architecture for improved performance. We first train an autoencoder with residual quantization to learn a discrete latent token space, where each token represents a small time-frequency patch of the RIR. Then, we cast the RIR estimation problem as a reference-conditioned autoregressive token generation task, employing transformer variants that operate across frequency, time, and quantization depth axes. This way, we address the standard blind estimation task and additional acoustic matching problem, which aims to find an RIR that matches the source signal to the target signal's reverberation characteristics. Experimental results show that our system is preferable to other baselines across various evaluation metrics.

* WASPAA 2023

Via

Access Paper or Ask Questions

Exploiting Time-Frequency Conformers for Music Audio Enhancement

Aug 24, 2023

Yunkee Chae, Junghyun Koo, Sungho Lee, Kyogu Lee

Figure 1 for Exploiting Time-Frequency Conformers for Music Audio Enhancement

Figure 2 for Exploiting Time-Frequency Conformers for Music Audio Enhancement

Figure 3 for Exploiting Time-Frequency Conformers for Music Audio Enhancement

Figure 4 for Exploiting Time-Frequency Conformers for Music Audio Enhancement

Abstract:With the proliferation of video platforms on the internet, recording musical performances by mobile devices has become commonplace. However, these recordings often suffer from degradation such as noise and reverberation, which negatively impact the listening experience. Consequently, the necessity for music audio enhancement (referred to as music enhancement from this point onward), involving the transformation of degraded audio recordings into pristine high-quality music, has surged to augment the auditory experience. To address this issue, we propose a music enhancement system based on the Conformer architecture that has demonstrated outstanding performance in speech enhancement tasks. Our approach explores the attention mechanisms of the Conformer and examines their performance to discover the best approach for the music enhancement task. Our experimental results show that our proposed model achieves state-of-the-art performance on single-stem music enhancement. Furthermore, our system can perform general music enhancement with multi-track mixtures, which has not been examined in previous work.

* Accepted by ACM Multimedia 2023

Via

Access Paper or Ask Questions

Blind Estimation of Audio Processing Graph

Mar 15, 2023

Sungho Lee, Jaehyun Park, Seungryeol Paik, Kyogu Lee

Abstract:Musicians and audio engineers sculpt and transform their sounds by connecting multiple processors, forming an audio processing graph. However, most deep-learning methods overlook this real-world practice and assume fixed graph settings. To bridge this gap, we develop a system that reconstructs the entire graph from a given reference audio. We first generate a realistic graph-reference pair dataset and train a simple blind estimation system composed of a convolutional reference encoder and a transformer-based graph decoder. We apply our model to singing voice effects and drum mixing estimation tasks. Evaluation results show that our method can reconstruct complex signal routings, including multi-band processing and sidechaining.

* Accepted to ICASSP 2023

Via

Access Paper or Ask Questions

Global HRTF Interpolation via Learned Affine Transformation of Hyper-conditioned Features

Apr 06, 2022

Jin Woo Lee, Sungho Lee, Kyogu Lee

Figure 1 for Global HRTF Interpolation via Learned Affine Transformation of Hyper-conditioned Features

Figure 2 for Global HRTF Interpolation via Learned Affine Transformation of Hyper-conditioned Features

Figure 3 for Global HRTF Interpolation via Learned Affine Transformation of Hyper-conditioned Features

Figure 4 for Global HRTF Interpolation via Learned Affine Transformation of Hyper-conditioned Features

Abstract:Estimating Head-Related Transfer Functions (HRTFs) of arbitrary source points is essential in immersive binaural audio rendering. Computing each individual's HRTFs is challenging, as traditional approaches require expensive time and computational resources, while modern data-driven approaches are data-hungry. Especially for the data-driven approaches, existing HRTF datasets differ in spatial sampling distributions of source positions, posing a major problem when generalizing the method across multiple datasets. To alleviate this, we propose a deep learning method based on a novel conditioning architecture. The proposed method can predict an HRTF of any position by interpolating the HRTFs of known distributions. Experimental results show that the proposed architecture improves the model's generalizability across datasets with various coordinate systems. Additional demonstrations using coarsened HRTFs demonstrate that the model robustly reconstructs the target HRTFs from the coarsened data.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Enhanced Correlation Matching based Video Frame Interpolation

Nov 17, 2021

Sungho Lee, Narae Choi, Woong Il Choi

Figure 1 for Enhanced Correlation Matching based Video Frame Interpolation

Figure 2 for Enhanced Correlation Matching based Video Frame Interpolation

Figure 3 for Enhanced Correlation Matching based Video Frame Interpolation

Figure 4 for Enhanced Correlation Matching based Video Frame Interpolation

Abstract:We propose a novel DNN based framework called the Enhanced Correlation Matching based Video Frame Interpolation Network to support high resolution like 4K, which has a large scale of motion and occlusion. Considering the extensibility of the network model according to resolution, the proposed scheme employs the recurrent pyramid architecture that shares the parameters among each pyramid layer for optical flow estimation. In the proposed flow estimation, the optical flows are recursively refined by tracing the location with maximum correlation. The forward warping based correlation matching enables to improve the accuracy of flow update by excluding incorrectly warped features around the occlusion area. Based on the final bi-directional flows, the intermediate frame at arbitrary temporal position is synthesized using the warping and blending network and it is further improved by refinement network. Experiment results demonstrate that the proposed scheme outperforms the previous works at 4K video data and low-resolution benchmark datasets as well in terms of objective and subjective quality with the smallest number of model parameters.

* Accepted to WACV 2022, equal contribution from first two authors

Via

Access Paper or Ask Questions

Differentiable Artificial Reverberation

May 28, 2021

Sungho Lee, Hyeong-Seok Choi, Kyogu Lee

Figure 1 for Differentiable Artificial Reverberation

Figure 2 for Differentiable Artificial Reverberation

Figure 3 for Differentiable Artificial Reverberation

Figure 4 for Differentiable Artificial Reverberation

Abstract:We propose differentiable artificial reverberation (DAR), a family of artificial reverberation (AR) models implemented in a deep learning framework. Combined with the modern deep neural networks (DNNs), the differentiable structure of DAR allows training loss gradients to be back-propagated in an end-to-end manner. Most of the AR models bottleneck training speed when implemented "as is" in the time domain and executed with a parallel processor like GPU due to their infinite impulse response (IIR) filter components. We tackle this by further developing a recently proposed acceleration technique, which borrows the frequency-sampling method (FSM). With the proposed DAR models, we aim to solve an artificial reverberation parameter (ARP) estimation task in a unified approach. We design an ARP estimation network applicable to both analysis-synthesis (RIR-to-ARP) and blind estimation (reverberant-speech-to-ARP) tasks. And using different DAR models only requires slightly a different decoder configuration. This way, the proposed DAR framework overcomes the previous methods' limitations of task-dependency and AR-model-dependency.

* Manuscript in progress

Via

Access Paper or Ask Questions