Picture for Xiaohan Wang

Xiaohan Wang

Video Action Differencing

Add code
Mar 10, 2025
Viaarxiv icon

SurgiSAM2: Fine-tuning a foundational model for surgical video anatomy segmentation and detection

Add code
Mar 05, 2025
Viaarxiv icon

Temporal Preference Optimization for Long-Form Video Understanding

Add code
Jan 23, 2025
Figure 1 for Temporal Preference Optimization for Long-Form Video Understanding
Figure 2 for Temporal Preference Optimization for Long-Form Video Understanding
Figure 3 for Temporal Preference Optimization for Long-Form Video Understanding
Figure 4 for Temporal Preference Optimization for Long-Form Video Understanding
Viaarxiv icon

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

Add code
Jan 14, 2025
Viaarxiv icon

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Add code
Jan 06, 2025
Figure 1 for Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Figure 2 for Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Figure 3 for Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Figure 4 for Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Viaarxiv icon

DeepSeek-V3 Technical Report

Add code
Dec 27, 2024
Figure 1 for DeepSeek-V3 Technical Report
Figure 2 for DeepSeek-V3 Technical Report
Figure 3 for DeepSeek-V3 Technical Report
Figure 4 for DeepSeek-V3 Technical Report
Viaarxiv icon

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Add code
Dec 17, 2024
Viaarxiv icon

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Add code
Dec 13, 2024
Figure 1 for Apollo: An Exploration of Video Understanding in Large Multimodal Models
Figure 2 for Apollo: An Exploration of Video Understanding in Large Multimodal Models
Figure 3 for Apollo: An Exploration of Video Understanding in Large Multimodal Models
Figure 4 for Apollo: An Exploration of Video Understanding in Large Multimodal Models
Viaarxiv icon

Targeted Learning for Variable Importance

Add code
Nov 04, 2024
Viaarxiv icon

Zero-shot Action Localization via the Confidence of Large Vision-Language Models

Add code
Oct 18, 2024
Viaarxiv icon