Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Mar 18, 2024

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li

Figure 1 for VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Figure 2 for VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Figure 3 for VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Figure 4 for VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Share this with someone who'll enjoy it:

Abstract:We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.

* Project page: videoagent.github.io; First two authors contributed equally

View paper on

Share this with someone who'll enjoy it:

Title:VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Paper and Code