Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shansong Liu

YuE: Scaling Open Foundation Models for Long-Form Music Generation

Mar 11, 2025

Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du(+47 more)

Abstract:We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation

* https://github.com/multimodal-art-projection/YuE

Via

Access Paper or Ask Questions

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

Dec 09, 2024

Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, Ying Shan

Abstract:Research on large language models has advanced significantly across text, speech, images, and videos. However, multi-modal music understanding and generation remain underexplored due to the lack of well-annotated datasets. To address this, we introduce a dataset with 167.69 hours of multi-modal data, including text, images, videos, and music annotations. Based on this dataset, we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music, images, and videos. For music generation, we integrate AudioLDM 2 and MusicGen. Our evaluation across four tasks--music understanding, text-to-music generation, prompt-based music editing, and multi-modal music generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models, showing its potential for multi-modal music applications.

Via

Access Paper or Ask Questions

Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer

Oct 07, 2024

Siyuan Hou, Shansong Liu, Ruibin Yuan, Wei Xue, Ying Shan, Mangsuo Zhao, Chao Zhang

Figure 1 for Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer

Figure 2 for Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer

Figure 3 for Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer

Abstract:Despite the significant progress in controllable music generation and editing, challenges remain in the quality and length of generated music due to the use of Mel-spectrogram representations and UNet-based model structures. To address these limitations, we propose a novel approach using a Diffusion Transformer (DiT) augmented with an additional control branch using ControlNet. This allows for long-form and variable-length music generation and editing controlled by text and melody prompts. For more precise and fine-grained melody control, we introduce a novel top-$k$ constant-Q Transform representation as the melody prompt, reducing ambiguity compared to previous representations (e.g., chroma), particularly for music with multiple tracks or a wide range of pitch values. To effectively balance the control signals from text and melody prompts, we adopt a curriculum learning strategy that progressively masks the melody prompt, resulting in a more stable training process. Experiments have been performed on text-to-music generation and music-style transfer tasks using open-source instrumental recording data. The results demonstrate that by extending StableAudio, a pre-trained text-controlled DiT model, our approach enables superior melody-controlled editing while retaining good text-to-music generation performance. These results outperform a strong MusicGen baseline in terms of both text-based generation and melody preservation for editing. Audio examples can be found at https://stable-audio-control.github.io/web/.

* 5 pages, 1 figure

Via

Access Paper or Ask Questions

M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

Nov 28, 2023

Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun, Ying Shan

$Figure 1 for M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models$

$Figure 2 for M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models$

$Figure 3 for M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models$

$Figure 4 for M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models$

Abstract:The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. They also utilize LLMs to understand human intention and generate desired outputs like images, videos, and music. However, research that combines both understanding and generation using LLMs is still limited and in its nascent stage. To address this gap, we introduce a Multi-modal Music Understanding and Generation (M$^{2}$UGen) framework that integrates LLM's abilities to comprehend and generate music for different modalities. The M$^{2}$UGen framework is purpose-built to unlock creative potential from diverse sources of inspiration, encompassing music, image, and video through the use of pretrained MERT, ViT, and ViViT models, respectively. To enable music generation, we explore the use of AudioLDM 2 and MusicGen. Bridging multi-modal understanding and music generation is accomplished through the integration of the LLaMA 2 model. Furthermore, we make use of the MU-LLaMA model to generate extensive datasets that support text/image/video-to-music generation, facilitating the training of our M$^{2}$UGen framework. We conduct a thorough evaluation of our proposed framework. The experimental results demonstrate that our model achieves or surpasses the performance of the current state-of-the-art models.

Via

Access Paper or Ask Questions

HumTrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond

Sep 18, 2023

Shansong Liu, Xu Li, Dian Li, Ying Shan

Figure 1 for HumTrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond

Figure 2 for HumTrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond

Figure 3 for HumTrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond

Figure 4 for HumTrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond

Abstract:This paper introduces the HumTrans dataset, which is publicly available and primarily designed for humming melody transcription. The dataset can also serve as a foundation for downstream tasks such as humming melody based music generation. It consists of 500 musical compositions of different genres and languages, with each composition divided into multiple segments. In total, the dataset comprises 1000 music segments. To collect this humming dataset, we employed 10 college students, all of whom are either music majors or proficient in playing at least one musical instrument. Each of them hummed every segment twice using the web recording interface provided by our designed website. The humming recordings were sampled at a frequency of 44,100 Hz. During the humming session, the main interface provides a musical score for students to reference, with the melody audio playing simultaneously to aid in capturing both melody and rhythm. The dataset encompasses approximately 56.22 hours of audio, making it the largest known humming dataset to date. The dataset will be released on Hugging Face, and we will provide a GitHub repository containing baseline results and evaluation codes.

Via

Access Paper or Ask Questions

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

Aug 22, 2023

Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan

Abstract:Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed MU-LLaMA model, trained on our designed MusicQA dataset, achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.

Via

Access Paper or Ask Questions

A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

Jul 06, 2022

Xu Li, Shansong Liu, Ying Shan

Figure 1 for A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

Figure 2 for A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

Figure 3 for A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

Figure 4 for A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

Abstract:Typically, singing voice conversion (SVC) depends on an embedding vector, extracted from either a speaker lookup table (LUT) or a speaker recognition network (SRN), to model speaker identity. However, singing contains more expressive speaker characteristics than conversational speech. It is suspected that a single embedding vector may only capture averaged and coarse-grained speaker characteristics, which is insufficient for the SVC task. To this end, this work proposes a novel hierarchical speaker representation framework for SVC, which can capture fine-grained speaker characteristics at different granularity. It consists of an up-sampling stream and three down-sampling streams. The up-sampling stream transforms the linguistic features into audio samples, while one down-sampling stream of the three operates in the reverse direction. It is expected that the temporal statistics of each down-sampling block can represent speaker characteristics at different granularity, which will be engaged in the up-sampling blocks to enhance the speaker modeling. Experiment results verify that the proposed method outperforms both the LUT and SRN based SVC systems. Moreover, the proposed system supports the one-shot SVC with only a few seconds of reference audio.

* Accepted to INTERSPEECH 2022; Made some motifications in Fig.1 so that the system architecture will be more clear

Via

Access Paper or Ask Questions

Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition

Mar 19, 2022

Shujie Hu, Shansong Liu, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shoukang Hu, Mingyu Cui, Xunying Liu, Helen Meng

Figure 1 for Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition

Figure 2 for Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition

Figure 3 for Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition

Figure 4 for Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition

Abstract:Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems for normal speech. Their practical application to disordered speech recognition is often limited by the difficulty in collecting such specialist data from impaired speakers. This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training before being cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features. Mixture density networks based neural A2A inversion models were used. A cross-domain feature adaptation network was also used to reduce the acoustic mismatch between the TORGO and UASpeech data. On both tasks, incorporating the A2A generated articulatory features consistently outperformed the baseline hybrid DNN/TDNN, CTC and Conformer based end-to-end systems constructed using acoustic features only. The best multi-modal system incorporating video modality and the cross-domain articulatory features as well as data augmentation and learning hidden unit contributions (LHUC) speaker adaptation produced the lowest published word error rate (WER) of 24.82% on the 16 dysarthric speakers of the benchmark UASpeech task.

* accepted by ICASSP 2022

Via

Access Paper or Ask Questions

Recent Progress in the CUHK Dysarthric Speech Recognition System

Jan 15, 2022

Shansong Liu, Mengzhe Geng, Shoukang Hu, Xurong Xie, Mingyu Cui, Jianwei Yu, Xunying Liu, Helen Meng

Figure 1 for Recent Progress in the CUHK Dysarthric Speech Recognition System

Figure 2 for Recent Progress in the CUHK Dysarthric Speech Recognition System

Figure 3 for Recent Progress in the CUHK Dysarthric Speech Recognition System

Figure 4 for Recent Progress in the CUHK Dysarthric Speech Recognition System

Abstract:Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date. Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based ASR technologies that predominantly target normal speech. This paper presents recent research efforts at the Chinese University of Hong Kong (CUHK) to improve the performance of disordered speech recognition systems on the largest publicly available UASpeech dysarthric speech corpus. A set of novel modelling techniques including neural architectural search, data augmentation using spectra-temporal perturbation, model based speaker adaptation and cross-domain generation of visual features within an audio-visual speech recognition (AVSR) system framework were employed to address the above challenges. The combination of these techniques produced the lowest published word error rate (WER) of 25.21% on the UASpeech test set 16 dysarthric speakers, and an overall WER reduction of 5.4% absolute (17.6% relative) over the CUHK 2018 dysarthric speech recognition system featuring a 6-way DNN system combination and cross adaptation of out-of-domain normal speech data trained systems. Bayesian model adaptation further allows rapid adaptation to individual dysarthric speakers to be performed using as little as 3.06 seconds of speech. The efficacy of these techniques were further demonstrated on a CUDYS Cantonese dysarthric speech recognition task.

Via

Access Paper or Ask Questions

Investigation of Data Augmentation Techniques for Disordered Speech Recognition

Jan 14, 2022

Mengzhe Geng, Xurong Xie, Shansong Liu, Jianwei Yu, Shoukang Hu, Xunying Liu, Helen Meng

Figure 1 for Investigation of Data Augmentation Techniques for Disordered Speech Recognition

Figure 2 for Investigation of Data Augmentation Techniques for Disordered Speech Recognition

Figure 3 for Investigation of Data Augmentation Techniques for Disordered Speech Recognition

Figure 4 for Investigation of Data Augmentation Techniques for Disordered Speech Recognition

Abstract:Disordered speech recognition is a highly challenging task. The underlying neuro-motor conditions of people with speech disorders, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of speech required for system development. This paper investigates a set of data augmentation techniques for disordered speech recognition, including vocal tract length perturbation (VTLP), tempo perturbation and speed perturbation. Both normal and disordered speech were exploited in the augmentation process. Variability among impaired speakers in both the original and augmented data was modeled using learning hidden unit contributions (LHUC) based speaker adaptive training. The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute (9.3% relative) word error rate (WER) reduction over the baseline system without data augmentation, and gave an overall WER of 26.37% on the test set containing 16 dysarthric speakers.

* Proceedings of INTERSPEECH 2020

Via

Access Paper or Ask Questions