Picture for Xiang Bai

Xiang Bai

Huazhong University of Science and Technology

Liquid: Language Models are Scalable Multi-modal Generators

Add code
Dec 05, 2024
Figure 1 for Liquid: Language Models are Scalable Multi-modal Generators
Figure 2 for Liquid: Language Models are Scalable Multi-modal Generators
Figure 3 for Liquid: Language Models are Scalable Multi-modal Generators
Figure 4 for Liquid: Language Models are Scalable Multi-modal Generators
Viaarxiv icon

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Add code
Dec 03, 2024
Viaarxiv icon

Partial Scene Text Retrieval

Add code
Nov 15, 2024
Figure 1 for Partial Scene Text Retrieval
Figure 2 for Partial Scene Text Retrieval
Figure 3 for Partial Scene Text Retrieval
Figure 4 for Partial Scene Text Retrieval
Viaarxiv icon

R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models

Add code
Oct 23, 2024
Viaarxiv icon

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Add code
Oct 21, 2024
Viaarxiv icon

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark

Add code
Oct 15, 2024
Viaarxiv icon

Parameter-Efficient Fine-Tuning in Spectral Domain for Point Cloud Learning

Add code
Oct 10, 2024
Viaarxiv icon

VIRT: Vision Instructed Transformer for Robotic Manipulation

Add code
Oct 09, 2024
Figure 1 for VIRT: Vision Instructed Transformer for Robotic Manipulation
Figure 2 for VIRT: Vision Instructed Transformer for Robotic Manipulation
Figure 3 for VIRT: Vision Instructed Transformer for Robotic Manipulation
Figure 4 for VIRT: Vision Instructed Transformer for Robotic Manipulation
Viaarxiv icon

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

Add code
Oct 08, 2024
Viaarxiv icon

Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression

Add code
Sep 01, 2024
Figure 1 for Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression
Figure 2 for Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression
Figure 3 for Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression
Figure 4 for Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression
Viaarxiv icon