Picture for Wenhu Chen

Wenhu Chen

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Add code
Dec 06, 2024
Viaarxiv icon

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Add code
Dec 01, 2024
Viaarxiv icon

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Add code
Nov 11, 2024
Figure 1 for OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
Figure 2 for OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
Figure 3 for OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
Figure 4 for OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
Viaarxiv icon

Harnessing Webpage UIs for Text-Rich Visual Understanding

Add code
Oct 17, 2024
Figure 1 for Harnessing Webpage UIs for Text-Rich Visual Understanding
Figure 2 for Harnessing Webpage UIs for Text-Rich Visual Understanding
Figure 3 for Harnessing Webpage UIs for Text-Rich Visual Understanding
Figure 4 for Harnessing Webpage UIs for Text-Rich Visual Understanding
Viaarxiv icon

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Add code
Oct 14, 2024
Viaarxiv icon

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

Add code
Oct 08, 2024
Figure 1 for T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design
Figure 2 for T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design
Figure 3 for T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design
Figure 4 for T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design
Viaarxiv icon

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Add code
Oct 07, 2024
Figure 1 for VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Figure 2 for VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Figure 3 for VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Figure 4 for VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Viaarxiv icon

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Add code
Sep 04, 2024
Figure 1 for MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Figure 2 for MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Figure 3 for MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Figure 4 for MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Viaarxiv icon

Foundation Models for Music: A Survey

Add code
Aug 27, 2024
Figure 1 for Foundation Models for Music: A Survey
Figure 2 for Foundation Models for Music: A Survey
Figure 3 for Foundation Models for Music: A Survey
Figure 4 for Foundation Models for Music: A Survey
Viaarxiv icon

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

Add code
Jun 26, 2024
Viaarxiv icon