Picture for Ziyang Ma

Ziyang Ma

Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

Add code
Oct 14, 2025
Viaarxiv icon

CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models

Add code
Aug 24, 2025
Viaarxiv icon

NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025

Add code
Jun 16, 2025
Viaarxiv icon

Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens

Add code
Jun 10, 2025
Viaarxiv icon

Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling

Add code
May 26, 2025
Viaarxiv icon

Towards Reliable Large Audio Language Model

Add code
May 25, 2025
Viaarxiv icon

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Add code
May 22, 2025
Viaarxiv icon

Towards Efficient Multi-Scale Deformable Attention on NPU

Add code
May 20, 2025
Figure 1 for Towards Efficient Multi-Scale Deformable Attention on NPU
Figure 2 for Towards Efficient Multi-Scale Deformable Attention on NPU
Figure 3 for Towards Efficient Multi-Scale Deformable Attention on NPU
Figure 4 for Towards Efficient Multi-Scale Deformable Attention on NPU
Viaarxiv icon

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Add code
May 19, 2025
Figure 1 for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Figure 2 for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Figure 3 for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Figure 4 for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Viaarxiv icon

Towards Flow-Matching-based TTS without Classifier-Free Guidance

Add code
Apr 29, 2025
Viaarxiv icon