Picture for Ziyang Ma

Ziyang Ma

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

Add code
Oct 28, 2024
Figure 1 for OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
Figure 2 for OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
Figure 3 for OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
Figure 4 for OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
Viaarxiv icon

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap

Add code
Oct 22, 2024
Viaarxiv icon

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs

Add code
Oct 12, 2024
Figure 1 for SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Figure 2 for SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Figure 3 for SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Figure 4 for SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Viaarxiv icon

DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning

Add code
Oct 12, 2024
Viaarxiv icon

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Add code
Oct 09, 2024
Viaarxiv icon

CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought

Add code
Sep 29, 2024
Figure 1 for CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought
Figure 2 for CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought
Figure 3 for CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought
Figure 4 for CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought
Viaarxiv icon

Progressive Residual Extraction based Pre-training for Speech Representation Learning

Add code
Aug 31, 2024
Viaarxiv icon

Foundation Models for Music: A Survey

Add code
Aug 27, 2024
Figure 1 for Foundation Models for Music: A Survey
Figure 2 for Foundation Models for Music: A Survey
Figure 3 for Foundation Models for Music: A Survey
Figure 4 for Foundation Models for Music: A Survey
Viaarxiv icon

Language Model Can Listen While Speaking

Add code
Aug 05, 2024
Viaarxiv icon

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Add code
Jul 09, 2024
Viaarxiv icon