Picture for Wenhai Wang

Wenhai Wang

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Add code
Mar 13, 2025
Viaarxiv icon

MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning

Add code
Mar 10, 2025
Viaarxiv icon

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Add code
Feb 25, 2025
Viaarxiv icon

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Add code
Jan 21, 2025
Viaarxiv icon

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Add code
Dec 20, 2024
Viaarxiv icon

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Add code
Dec 12, 2024
Figure 1 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Figure 2 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Figure 3 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Figure 4 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Viaarxiv icon

DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction

Add code
Dec 09, 2024
Figure 1 for DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction
Figure 2 for DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction
Figure 3 for DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction
Figure 4 for DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction
Viaarxiv icon

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Add code
Dec 06, 2024
Figure 1 for Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Figure 2 for Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Figure 3 for Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Figure 4 for Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Viaarxiv icon

MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

Add code
Dec 02, 2024
Figure 1 for MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
Figure 2 for MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
Figure 3 for MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
Figure 4 for MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
Viaarxiv icon

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Add code
Nov 15, 2024
Figure 1 for Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Figure 2 for Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Figure 3 for Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Figure 4 for Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Viaarxiv icon