Picture for Wenhai Wang

Wenhai Wang

RP-CATE: Recurrent Perceptron-based Channel Attention Transformer Encoder for Industrial Hybrid Modeling

Add code
Dec 22, 2025
Viaarxiv icon

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

Add code
Oct 14, 2025
Figure 1 for ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
Figure 2 for ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
Figure 3 for ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
Figure 4 for ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
Viaarxiv icon

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Add code
Sep 18, 2025
Figure 1 for ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Figure 2 for ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Figure 3 for ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Figure 4 for ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Viaarxiv icon

GenExam: A Multidisciplinary Text-to-Image Exam

Add code
Sep 17, 2025
Figure 1 for GenExam: A Multidisciplinary Text-to-Image Exam
Figure 2 for GenExam: A Multidisciplinary Text-to-Image Exam
Figure 3 for GenExam: A Multidisciplinary Text-to-Image Exam
Figure 4 for GenExam: A Multidisciplinary Text-to-Image Exam
Viaarxiv icon

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Add code
Aug 25, 2025
Figure 1 for InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Figure 2 for InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Figure 3 for InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Figure 4 for InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Viaarxiv icon

CoMemo: LVLMs Need Image Context with Image Memory

Add code
Jun 06, 2025
Figure 1 for CoMemo: LVLMs Need Image Context with Image Memory
Figure 2 for CoMemo: LVLMs Need Image Context with Image Memory
Figure 3 for CoMemo: LVLMs Need Image Context with Image Memory
Figure 4 for CoMemo: LVLMs Need Image Context with Image Memory
Viaarxiv icon

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

Add code
Jun 04, 2025
Viaarxiv icon

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Add code
May 29, 2025
Figure 1 for ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Figure 2 for ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Figure 3 for ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Figure 4 for ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Viaarxiv icon

Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings

Add code
May 29, 2025
Viaarxiv icon

EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models

Add code
May 28, 2025
Viaarxiv icon