Picture for Xiang Bai

Xiang Bai

Huazhong University of Science and Technology

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Add code
Apr 09, 2026
Viaarxiv icon

PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding

Add code
Apr 06, 2026
Viaarxiv icon

MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios

Add code
Mar 30, 2026
Viaarxiv icon

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Add code
Mar 26, 2026
Viaarxiv icon

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Add code
Mar 19, 2026
Viaarxiv icon

Towards Generalizable Robotic Manipulation in Dynamic Environments

Add code
Mar 16, 2026
Viaarxiv icon

Multimodal OCR: Parse Anything from Documents

Add code
Mar 13, 2026
Viaarxiv icon

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

Add code
Mar 12, 2026
Viaarxiv icon

TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

Add code
Feb 26, 2026
Viaarxiv icon

ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding

Add code
Feb 26, 2026
Viaarxiv icon