Picture for Xinfa Zhu

Xinfa Zhu

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Add code
Mar 03, 2025
Viaarxiv icon

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Add code
Feb 25, 2025
Viaarxiv icon

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Add code
Feb 06, 2025
Viaarxiv icon

CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions

Add code
Jan 28, 2025
Viaarxiv icon

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training

Add code
Jan 08, 2025
Figure 1 for ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Figure 2 for ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Figure 3 for ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Figure 4 for ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Viaarxiv icon

Autoregressive Speech Synthesis with Next-Distribution Prediction

Add code
Dec 22, 2024
Viaarxiv icon

YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

Add code
Dec 12, 2024
Figure 1 for YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
Figure 2 for YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
Figure 3 for YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
Figure 4 for YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
Viaarxiv icon

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Add code
Dec 03, 2024
Figure 1 for CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion
Figure 2 for CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion
Figure 3 for CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion
Figure 4 for CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion
Viaarxiv icon

The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge

Add code
Oct 31, 2024
Viaarxiv icon

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

Add code
Jun 14, 2024
Viaarxiv icon