Picture for Zejun Ma

Zejun Ma

Video Instruction Tuning With Synthetic Data

Add code
Oct 03, 2024
Viaarxiv icon

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Add code
Jul 10, 2024
Viaarxiv icon

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Add code
Jun 22, 2024
Viaarxiv icon

Can Large Language Models Understand Spatial Audio?

Add code
Jun 12, 2024
Viaarxiv icon

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR

Add code
Mar 04, 2024
Viaarxiv icon

SLIT: Boosting Audio-Text Pre-Training via Multi-Stage Learning and Instruction Tuning

Add code
Feb 20, 2024
Viaarxiv icon

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

Add code
Jan 20, 2024
Viaarxiv icon

Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer

Add code
Nov 15, 2023
Viaarxiv icon

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Add code
Oct 20, 2023
Viaarxiv icon

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Add code
Oct 10, 2023
Viaarxiv icon