Picture for Michael S. Ryoo

Michael S. Ryoo

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

Add code
Nov 22, 2024
Figure 1 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Figure 2 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Figure 3 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Figure 4 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Viaarxiv icon

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Add code
Nov 04, 2024
Figure 1 for Adaptive Caching for Faster Video Generation with Diffusion Transformers
Figure 2 for Adaptive Caching for Faster Video Generation with Diffusion Transformers
Figure 3 for Adaptive Caching for Faster Video Generation with Diffusion Transformers
Figure 4 for Adaptive Caching for Faster Video Generation with Diffusion Transformers
Viaarxiv icon

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Add code
Oct 21, 2024
Figure 1 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 2 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 3 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 4 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Viaarxiv icon

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Add code
Jun 28, 2024
Figure 1 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Figure 2 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Figure 3 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Figure 4 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Viaarxiv icon

Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA

Add code
Jun 17, 2024
Figure 1 for Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA
Figure 2 for Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA
Figure 3 for Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA
Figure 4 for Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA
Viaarxiv icon

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Add code
Apr 11, 2024
Figure 1 for Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Figure 2 for Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Figure 3 for Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Figure 4 for Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Viaarxiv icon

Understanding Long Videos in One Multimodal Language Model Pass

Add code
Mar 25, 2024
Figure 1 for Understanding Long Videos in One Multimodal Language Model Pass
Figure 2 for Understanding Long Videos in One Multimodal Language Model Pass
Figure 3 for Understanding Long Videos in One Multimodal Language Model Pass
Figure 4 for Understanding Long Videos in One Multimodal Language Model Pass
Viaarxiv icon

Language Repository for Long Video Understanding

Add code
Mar 21, 2024
Figure 1 for Language Repository for Long Video Understanding
Figure 2 for Language Repository for Long Video Understanding
Figure 3 for Language Repository for Long Video Understanding
Figure 4 for Language Repository for Long Video Understanding
Viaarxiv icon

Diffusion Illusions: Hiding Images in Plain Sight

Add code
Dec 06, 2023
Figure 1 for Diffusion Illusions: Hiding Images in Plain Sight
Figure 2 for Diffusion Illusions: Hiding Images in Plain Sight
Figure 3 for Diffusion Illusions: Hiding Images in Plain Sight
Figure 4 for Diffusion Illusions: Hiding Images in Plain Sight
Viaarxiv icon

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Add code
Nov 13, 2023
Figure 1 for Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
Figure 2 for Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
Figure 3 for Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
Figure 4 for Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
Viaarxiv icon