Picture for Sheng Zhao

Sheng Zhao

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training

Add code
Jan 08, 2025
Viaarxiv icon

Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners

Add code
Dec 06, 2024
Figure 1 for Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
Figure 2 for Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
Figure 3 for Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
Figure 4 for Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
Viaarxiv icon

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation

Add code
Sep 06, 2024
Figure 1 for Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
Figure 2 for Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
Figure 3 for Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
Figure 4 for Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
Viaarxiv icon

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

Add code
Jul 17, 2024
Figure 1 for Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech
Figure 2 for Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech
Figure 3 for Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech
Figure 4 for Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech
Viaarxiv icon

Autoregressive Speech Synthesis without Vector Quantization

Add code
Jul 11, 2024
Viaarxiv icon

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Add code
Jun 26, 2024
Figure 1 for E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Figure 2 for E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Figure 3 for E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Figure 4 for E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Viaarxiv icon

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

Add code
Jun 12, 2024
Viaarxiv icon

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

Add code
Jun 09, 2024
Figure 1 for An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS
Figure 2 for An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS
Figure 3 for An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS
Viaarxiv icon

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

Add code
Jun 08, 2024
Viaarxiv icon

Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study

Add code
Jun 07, 2024
Figure 1 for Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study
Figure 2 for Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study
Figure 3 for Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study
Figure 4 for Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study
Viaarxiv icon