Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Nov 21, 2024

Weiheng Lu, Jian Li, An Yu, Ming-Ching Chang, Shengpeng Ji, Min Xia

Figure 1 for LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Figure 2 for LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Figure 3 for LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Figure 4 for LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Share this with someone who'll enjoy it:

Abstract:Multimodal Large Language Models (MLLMs) are widely used for visual perception, understanding, and reasoning. However, long video processing and precise moment retrieval remain challenging due to LLMs' limited context size and coarse frame extraction. We propose the Large Language-and-Vision Assistant for Moment Retrieval (LLaVA-MR), which enables accurate moment retrieval and contextual grounding in videos using MLLMs. LLaVA-MR combines Dense Frame and Time Encoding (DFTE) for spatial-temporal feature extraction, Informative Frame Selection (IFS) for capturing brief visual and motion patterns, and Dynamic Token Compression (DTC) to manage LLM context limitations. Evaluations on benchmarks like Charades-STA and QVHighlights demonstrate that LLaVA-MR outperforms 11 state-of-the-art methods, achieving an improvement of 1.82% in R1@0.5 and 1.29% in mAP@0.5 on the QVHighlights dataset. Our implementation will be open-sourced upon acceptance.

View paper on

Share this with someone who'll enjoy it:

Title:LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Paper and Code