Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

Oct 15, 2024

Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma

Figure 1 for VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

Figure 2 for VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

Figure 3 for VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

Figure 4 for VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

Share this with someone who'll enjoy it:

Abstract:Video-based multimodal large language models (Video-LLMs) possess significant potential for video understanding tasks. However, most Video-LLMs treat videos as a sequential set of individual frames, which results in insufficient temporal-spatial interaction that hinders fine-grained comprehension and difficulty in processing longer videos due to limited visual token capacity. To address these challenges, we propose VidCompress, a novel Video-LLM featuring memory-enhanced temporal compression. VidCompress employs a dual-compressor approach: a memory-enhanced compressor captures both short-term and long-term temporal relationships in videos and compresses the visual tokens using a multiscale transformer with a memory-cache mechanism, while a text-perceived compressor generates condensed visual tokens by utilizing Q-Former and integrating temporal contexts into query embeddings with cross attention. Experiments on several VideoQA datasets and comprehensive benchmarks demonstrate that VidCompress efficiently models complex temporal-spatial relations and significantly outperforms existing Video-LLMs.

* 9 pages, 4 figures

View paper on

Share this with someone who'll enjoy it:

Title:VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

Paper and Code