Picture for Shuhuai Ren

Shuhuai Ren

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Add code
May 31, 2024
Viaarxiv icon

DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models

Add code
May 31, 2024
Viaarxiv icon

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Add code
Apr 16, 2024
Figure 1 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Figure 2 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Figure 3 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Figure 4 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Viaarxiv icon

Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality

Add code
Mar 28, 2024
Viaarxiv icon

TempCompass: Do Video LLMs Really Understand Videos?

Add code
Mar 01, 2024
Figure 1 for TempCompass: Do Video LLMs Really Understand Videos?
Figure 2 for TempCompass: Do Video LLMs Really Understand Videos?
Figure 3 for TempCompass: Do Video LLMs Really Understand Videos?
Figure 4 for TempCompass: Do Video LLMs Really Understand Videos?
Viaarxiv icon

PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain

Add code
Feb 21, 2024
Viaarxiv icon

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Add code
Dec 04, 2023
Viaarxiv icon

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

Add code
Nov 29, 2023
Viaarxiv icon

FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

Add code
Nov 08, 2023
Viaarxiv icon

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Add code
Oct 29, 2023
Viaarxiv icon