Picture for Gedas Bertasius

Gedas Bertasius

DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding

Add code
Nov 17, 2025
Figure 1 for DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
Figure 2 for DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
Figure 3 for DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
Figure 4 for DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
Viaarxiv icon

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

Add code
Jul 09, 2025
Viaarxiv icon

ExAct: A Video-Language Benchmark for Expert Action Analysis

Add code
Jun 06, 2025
Viaarxiv icon

SiLVR: A Simple Language-based Video Reasoning Framework

Add code
May 30, 2025
Viaarxiv icon

BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation

Add code
Mar 26, 2025
Viaarxiv icon

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Add code
Mar 26, 2025
Figure 1 for Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
Figure 2 for Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
Figure 3 for Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
Figure 4 for Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
Viaarxiv icon

ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis

Add code
Mar 15, 2025
Viaarxiv icon

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

Add code
Mar 13, 2025
Viaarxiv icon

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task

Add code
Feb 21, 2025
Viaarxiv icon

TimeRefine: Temporal Grounding with Time Refining Video LLM

Add code
Dec 12, 2024
Figure 1 for TimeRefine: Temporal Grounding with Time Refining Video LLM
Figure 2 for TimeRefine: Temporal Grounding with Time Refining Video LLM
Figure 3 for TimeRefine: Temporal Grounding with Time Refining Video LLM
Figure 4 for TimeRefine: Temporal Grounding with Time Refining Video LLM
Viaarxiv icon