Picture for Zhe Gan

Zhe Gan

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Add code
Oct 24, 2024
Viaarxiv icon

Improve Vision Language Model Chain-of-thought Reasoning

Add code
Oct 21, 2024
Figure 1 for Improve Vision Language Model Chain-of-thought Reasoning
Figure 2 for Improve Vision Language Model Chain-of-thought Reasoning
Figure 3 for Improve Vision Language Model Chain-of-thought Reasoning
Figure 4 for Improve Vision Language Model Chain-of-thought Reasoning
Viaarxiv icon

MM-Ego: Towards Building Egocentric Multimodal LLMs

Add code
Oct 09, 2024
Figure 1 for MM-Ego: Towards Building Egocentric Multimodal LLMs
Figure 2 for MM-Ego: Towards Building Egocentric Multimodal LLMs
Figure 3 for MM-Ego: Towards Building Egocentric Multimodal LLMs
Figure 4 for MM-Ego: Towards Building Egocentric Multimodal LLMs
Viaarxiv icon

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Add code
Oct 03, 2024
Viaarxiv icon

Contrastive Localized Language-Image Pre-Training

Add code
Oct 03, 2024
Viaarxiv icon

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Add code
Sep 30, 2024
Viaarxiv icon

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Add code
Jul 22, 2024
Viaarxiv icon

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Add code
Jul 02, 2024
Viaarxiv icon

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Add code
Jul 01, 2024
Viaarxiv icon

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Add code
Apr 11, 2024
Viaarxiv icon