Picture for Sihan Yang

Sihan Yang

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Add code
Dec 17, 2025
Viaarxiv icon

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Add code
Dec 11, 2025
Viaarxiv icon

The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss

Add code
Dec 09, 2025
Viaarxiv icon

Improving Alignment in LVLMs with Debiased Self-Judgment

Add code
Aug 28, 2025
Figure 1 for Improving Alignment in LVLMs with Debiased Self-Judgment
Figure 2 for Improving Alignment in LVLMs with Debiased Self-Judgment
Figure 3 for Improving Alignment in LVLMs with Debiased Self-Judgment
Figure 4 for Improving Alignment in LVLMs with Debiased Self-Judgment
Viaarxiv icon

Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM

Add code
Aug 10, 2025
Figure 1 for Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM
Figure 2 for Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM
Figure 3 for Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM
Figure 4 for Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM
Viaarxiv icon

VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization

Add code
Aug 07, 2025
Viaarxiv icon

VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks

Add code
Jun 10, 2025
Figure 1 for VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks
Figure 2 for VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks
Figure 3 for VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks
Figure 4 for VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks
Viaarxiv icon

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Add code
May 29, 2025
Viaarxiv icon

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

Add code
May 28, 2025
Viaarxiv icon

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Add code
May 27, 2025
Viaarxiv icon