Picture for Shuhuai Ren

Shuhuai Ren

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

Add code
Mar 20, 2025
Viaarxiv icon

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

Add code
Mar 13, 2025
Viaarxiv icon

Next Block Prediction: Video Generation via Semi-Autoregressive Modeling

Add code
Feb 12, 2025
Viaarxiv icon

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Add code
Dec 30, 2024
Figure 1 for Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Figure 2 for Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Figure 3 for Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Figure 4 for Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Viaarxiv icon

Parallelized Autoregressive Visual Generation

Add code
Dec 19, 2024
Figure 1 for Parallelized Autoregressive Visual Generation
Figure 2 for Parallelized Autoregressive Visual Generation
Figure 3 for Parallelized Autoregressive Visual Generation
Figure 4 for Parallelized Autoregressive Visual Generation
Viaarxiv icon

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Add code
May 31, 2024
Figure 1 for Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Figure 2 for Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Figure 3 for Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Figure 4 for Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Viaarxiv icon

DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models

Add code
May 31, 2024
Figure 1 for DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Figure 2 for DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Figure 3 for DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Figure 4 for DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Viaarxiv icon

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Add code
Apr 16, 2024
Figure 1 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Figure 2 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Figure 3 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Figure 4 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Viaarxiv icon

Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality

Add code
Mar 28, 2024
Figure 1 for Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality
Figure 2 for Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality
Figure 3 for Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality
Figure 4 for Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality
Viaarxiv icon

TempCompass: Do Video LLMs Really Understand Videos?

Add code
Mar 01, 2024
Figure 1 for TempCompass: Do Video LLMs Really Understand Videos?
Figure 2 for TempCompass: Do Video LLMs Really Understand Videos?
Figure 3 for TempCompass: Do Video LLMs Really Understand Videos?
Figure 4 for TempCompass: Do Video LLMs Really Understand Videos?
Viaarxiv icon