Picture for Xiang Bai

Xiang Bai

Huazhong University of Science and Technology

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

Add code
Jan 24, 2025
Figure 1 for HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
Figure 2 for HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
Figure 3 for HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
Figure 4 for HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
Viaarxiv icon

Training-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing

Add code
Jan 20, 2025
Viaarxiv icon

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

Add code
Jan 07, 2025
Viaarxiv icon

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Add code
Dec 31, 2024
Figure 1 for OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Figure 2 for OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Figure 3 for OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Figure 4 for OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Viaarxiv icon

MINIMA: Modality Invariant Image Matching

Add code
Dec 27, 2024
Viaarxiv icon

Liquid: Language Models are Scalable Multi-modal Generators

Add code
Dec 05, 2024
Figure 1 for Liquid: Language Models are Scalable Multi-modal Generators
Figure 2 for Liquid: Language Models are Scalable Multi-modal Generators
Figure 3 for Liquid: Language Models are Scalable Multi-modal Generators
Figure 4 for Liquid: Language Models are Scalable Multi-modal Generators
Viaarxiv icon

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Add code
Dec 03, 2024
Viaarxiv icon

Partial Scene Text Retrieval

Add code
Nov 15, 2024
Figure 1 for Partial Scene Text Retrieval
Figure 2 for Partial Scene Text Retrieval
Figure 3 for Partial Scene Text Retrieval
Figure 4 for Partial Scene Text Retrieval
Viaarxiv icon

R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models

Add code
Oct 23, 2024
Viaarxiv icon

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Add code
Oct 21, 2024
Viaarxiv icon