Picture for Yali Wang

Yali Wang

ShenZhen Key Lab of Computer Vision and Pattern Recognition, SIAT-SenseTime Joint Lab, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

Add code
Jan 08, 2025
Viaarxiv icon

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Add code
Dec 31, 2024
Viaarxiv icon

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Add code
Dec 30, 2024
Viaarxiv icon

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Add code
Dec 26, 2024
Viaarxiv icon

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

Add code
Dec 16, 2024
Figure 1 for CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Figure 2 for CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Figure 3 for CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Figure 4 for CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Viaarxiv icon

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

Add code
Dec 11, 2024
Viaarxiv icon

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

Add code
Oct 25, 2024
Viaarxiv icon

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

Add code
Oct 16, 2024
Viaarxiv icon

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Add code
Aug 21, 2024
Viaarxiv icon

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

Add code
Jun 27, 2024
Viaarxiv icon