Picture for Serena Yeung-Levy

Serena Yeung-Levy

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

Add code
Mar 17, 2025
Viaarxiv icon

Video Action Differencing

Add code
Mar 10, 2025
Viaarxiv icon

SurgiSAM2: Fine-tuning a foundational model for surgical video anatomy segmentation and detection

Add code
Mar 05, 2025
Viaarxiv icon

Temporal Preference Optimization for Long-Form Video Understanding

Add code
Jan 23, 2025
Figure 1 for Temporal Preference Optimization for Long-Form Video Understanding
Figure 2 for Temporal Preference Optimization for Long-Form Video Understanding
Figure 3 for Temporal Preference Optimization for Long-Form Video Understanding
Figure 4 for Temporal Preference Optimization for Long-Form Video Understanding
Viaarxiv icon

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

Add code
Jan 14, 2025
Viaarxiv icon

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Add code
Jan 06, 2025
Figure 1 for Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Figure 2 for Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Figure 3 for Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Figure 4 for Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Viaarxiv icon

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Add code
Dec 17, 2024
Viaarxiv icon

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Add code
Dec 13, 2024
Figure 1 for Apollo: An Exploration of Video Understanding in Large Multimodal Models
Figure 2 for Apollo: An Exploration of Video Understanding in Large Multimodal Models
Figure 3 for Apollo: An Exploration of Video Understanding in Large Multimodal Models
Figure 4 for Apollo: An Exploration of Video Understanding in Large Multimodal Models
Viaarxiv icon

DeforHMR: Vision Transformer with Deformable Cross-Attention for 3D Human Mesh Recovery

Add code
Nov 18, 2024
Viaarxiv icon

Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera

Add code
Nov 15, 2024
Figure 1 for Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera
Figure 2 for Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera
Figure 3 for Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera
Figure 4 for Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera
Viaarxiv icon