Picture for Xu Tan

Xu Tan

MoonCast: High-Quality Zero-Shot Podcast Generation

Add code
Mar 19, 2025
Viaarxiv icon

AudioX: Diffusion Transformer for Anything-to-Audio Generation

Add code
Mar 13, 2025
Viaarxiv icon

YuE: Scaling Open Foundation Models for Long-Form Music Generation

Add code
Mar 11, 2025
Viaarxiv icon

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

Add code
Mar 06, 2025
Viaarxiv icon

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Add code
Feb 06, 2025
Viaarxiv icon

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training

Add code
Jan 08, 2025
Figure 1 for ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Figure 2 for ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Figure 3 for ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Figure 4 for ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Viaarxiv icon

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Add code
Dec 30, 2024
Figure 1 for Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Figure 2 for Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Figure 3 for Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Figure 4 for Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Viaarxiv icon

Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning

Add code
Nov 05, 2024
Viaarxiv icon

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

Add code
Oct 17, 2024
Figure 1 for CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
Figure 2 for CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
Figure 3 for CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
Figure 4 for CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
Viaarxiv icon

Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models

Add code
Sep 21, 2024
Figure 1 for Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models
Figure 2 for Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models
Figure 3 for Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models
Figure 4 for Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models
Viaarxiv icon