Picture for Zejun Ma

Zejun Ma

Improving LLM Video Understanding with 16 Frames Per Second

Add code
Mar 18, 2025
Viaarxiv icon

Video Instruction Tuning With Synthetic Data

Add code
Oct 03, 2024
Figure 1 for Video Instruction Tuning With Synthetic Data
Figure 2 for Video Instruction Tuning With Synthetic Data
Figure 3 for Video Instruction Tuning With Synthetic Data
Figure 4 for Video Instruction Tuning With Synthetic Data
Viaarxiv icon

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Add code
Jul 10, 2024
Viaarxiv icon

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Add code
Jun 22, 2024
Viaarxiv icon

Can Large Language Models Understand Spatial Audio?

Add code
Jun 12, 2024
Viaarxiv icon

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR

Add code
Mar 04, 2024
Viaarxiv icon

SLIT: Boosting Audio-Text Pre-Training via Multi-Stage Learning and Instruction Tuning

Add code
Feb 20, 2024
Viaarxiv icon

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

Add code
Jan 20, 2024
Viaarxiv icon

Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer

Add code
Nov 15, 2023
Viaarxiv icon

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Add code
Oct 20, 2023
Figure 1 for SALMONN: Towards Generic Hearing Abilities for Large Language Models
Figure 2 for SALMONN: Towards Generic Hearing Abilities for Large Language Models
Figure 3 for SALMONN: Towards Generic Hearing Abilities for Large Language Models
Figure 4 for SALMONN: Towards Generic Hearing Abilities for Large Language Models
Viaarxiv icon