Picture for Michael S. Ryoo

Michael S. Ryoo

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

Add code
Nov 22, 2024
Figure 1 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Figure 2 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Figure 3 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Figure 4 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Viaarxiv icon

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Add code
Nov 04, 2024
Viaarxiv icon

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Add code
Oct 21, 2024
Figure 1 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 2 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 3 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 4 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Viaarxiv icon

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Add code
Jun 28, 2024
Figure 1 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Figure 2 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Figure 3 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Figure 4 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Viaarxiv icon

Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA

Add code
Jun 17, 2024
Viaarxiv icon

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Add code
Apr 11, 2024
Viaarxiv icon

Understanding Long Videos in One Multimodal Language Model Pass

Add code
Mar 25, 2024
Viaarxiv icon

Language Repository for Long Video Understanding

Add code
Mar 21, 2024
Viaarxiv icon

Diffusion Illusions: Hiding Images in Plain Sight

Add code
Dec 06, 2023
Viaarxiv icon

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Add code
Nov 13, 2023
Viaarxiv icon