Picture for Ziang Yan

Ziang Yan

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Add code
Jun 10, 2026
Viaarxiv icon

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Add code
Jun 04, 2026
Viaarxiv icon

ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

Add code
Jun 04, 2026
Viaarxiv icon

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Add code
Mar 26, 2026
Viaarxiv icon

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

Add code
Jan 30, 2026
Viaarxiv icon

DSI-Bench: A Benchmark for Dynamic Spatial Intelligence

Add code
Oct 21, 2025
Viaarxiv icon

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Add code
Apr 10, 2025
Viaarxiv icon

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Add code
Jan 21, 2025
Figure 1 for InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Figure 2 for InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Figure 3 for InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Figure 4 for InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Viaarxiv icon

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Add code
Dec 26, 2024
Viaarxiv icon

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

Add code
Oct 25, 2024
Figure 1 for TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Figure 2 for TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Figure 3 for TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Figure 4 for TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Viaarxiv icon