Picture for Chaoyou Fu

Chaoyou Fu

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Add code
Feb 13, 2025
Viaarxiv icon

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray

Add code
Feb 07, 2025
Viaarxiv icon

LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

Add code
Jan 27, 2025
Viaarxiv icon

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Add code
Jan 03, 2025
Viaarxiv icon

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Add code
Dec 12, 2024
Viaarxiv icon

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

Add code
Dec 02, 2024
Figure 1 for T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs
Figure 2 for T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs
Figure 3 for T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs
Figure 4 for T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs
Viaarxiv icon

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Add code
Nov 22, 2024
Figure 1 for MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Figure 2 for MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Figure 3 for MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Figure 4 for MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Viaarxiv icon

MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

Add code
Nov 05, 2024
Figure 1 for MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
Figure 2 for MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
Figure 3 for MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
Figure 4 for MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
Viaarxiv icon

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Add code
Nov 01, 2024
Viaarxiv icon

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Add code
Aug 23, 2024
Figure 1 for MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Figure 2 for MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Figure 3 for MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Figure 4 for MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Viaarxiv icon