Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:LongVLM: Efficient Long Video Understanding via Large Language Models

Apr 10, 2024

Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang

Figure 1 for LongVLM: Efficient Long Video Understanding via Large Language Models

Figure 2 for LongVLM: Efficient Long Video Understanding via Large Language Models

Figure 3 for LongVLM: Efficient Long Video Understanding via Large Language Models

Figure 4 for LongVLM: Efficient Long Video Understanding via Large Language Models

Share this with someone who'll enjoy it:

Abstract:Empowered by Large Language Models (LLMs), recent advancements in VideoLLMs have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a vast number of visual tokens, making computational and memory costs affordable. Despite successfully providing an overall comprehension of video content, existing VideoLLMs still face challenges in achieving detailed understanding in videos due to overlooking local information in long-term videos. To tackle this challenge, we introduce LongVLM, a straightforward yet powerful VideoLLM for long video understanding, building upon the observation that long videos often consist of sequential key events, complex actions, and camera movements. Our approach proposes to decompose long videos into multiple short-term segments and encode local features for each local segment via a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across sequential short-term segments. Additionally, we propose to integrate global semantics into each local feature to enhance context understanding. In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets demonstrate the superior capabilities of our model over the previous state-of-the-art methods. Qualitative examples demonstrate that our model produces more precise responses for long videos understanding. Code will be available at https://github.com/ziplab/LongVLM.

View paper on

Share this with someone who'll enjoy it:

Title:LongVLM: Efficient Long Video Understanding via Large Language Models

Paper and Code