Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Understanding Long Videos in One Multimodal Language Model Pass

Mar 25, 2024

Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo

Figure 1 for Understanding Long Videos in One Multimodal Language Model Pass

Figure 2 for Understanding Long Videos in One Multimodal Language Model Pass

Figure 3 for Understanding Long Videos in One Multimodal Language Model Pass

Figure 4 for Understanding Long Videos in One Multimodal Language Model Pass

Share this with someone who'll enjoy it:

Abstract:Large Language Models (LLMs), known to contain a strong awareness of world knowledge, have allowed recent approaches to achieve excellent performance on Long-Video Understanding benchmarks, but at high inference costs. In this work, we first propose Likelihood Selection, a simple technique that unlocks faster inference in autoregressive LLMs for multiple-choice tasks common in long-video benchmarks. In addition to faster inference, we discover the resulting models to yield surprisingly good accuracy on long-video tasks, even with no video specific information. Building on this, we inject video-specific object-centric information extracted from off-the-shelf pre-trained models and utilize natural language as a medium for information fusion. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across long-video and fine-grained action recognition benchmarks. Code available at: https://github.com/kahnchana/mvu

* 24 pages

View paper on

Share this with someone who'll enjoy it:

Title:Understanding Long Videos in One Multimodal Language Model Pass

Paper and Code