Picture for Pei Fu

Pei Fu

Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

Add code
Apr 15, 2026
Viaarxiv icon

Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

Add code
Mar 31, 2026
Viaarxiv icon

IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation

Add code
Mar 11, 2026
Viaarxiv icon

EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models

Add code
Feb 27, 2026
Viaarxiv icon

PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration

Add code
Feb 22, 2026
Viaarxiv icon

GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models

Add code
Jan 26, 2026
Viaarxiv icon

Xiaomi MiMo-VL-Miloco Technical Report

Add code
Dec 22, 2025
Figure 1 for Xiaomi MiMo-VL-Miloco Technical Report
Figure 2 for Xiaomi MiMo-VL-Miloco Technical Report
Figure 3 for Xiaomi MiMo-VL-Miloco Technical Report
Figure 4 for Xiaomi MiMo-VL-Miloco Technical Report
Viaarxiv icon

HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration

Add code
Oct 31, 2025
Viaarxiv icon

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

Add code
Sep 19, 2025
Viaarxiv icon

Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding

Add code
Mar 18, 2025
Figure 1 for Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Figure 2 for Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Figure 3 for Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Figure 4 for Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Viaarxiv icon