Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding

Feb 25, 2024

Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Zilong Zheng

Figure 1 for LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding

Figure 2 for LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding

Figure 3 for LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding

Figure 4 for LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding

Share this with someone who'll enjoy it:

Abstract:Despite progress in video-language modeling, the computational challenge of interpreting long-form videos in response to task-specific linguistic queries persists, largely due to the complexity of high-dimensional video data and the misalignment between language and visual cues over space and time. To tackle this issue, we introduce a novel approach called Language-guided Spatial-Temporal Prompt Learning (LSTP). This approach features two key components: a Temporal Prompt Sampler (TPS) with optical flow prior that leverages temporal information to efficiently extract relevant video content, and a Spatial Prompt Solver (SPS) that adeptly captures the intricate spatial relationships between visual and textual elements. By harmonizing TPS and SPS with a cohesive training strategy, our framework significantly enhances computational efficiency, temporal understanding, and spatial-temporal alignment. Empirical evaluations across two challenging tasks--video question answering and temporal question grounding in videos--using a variety of video-language pretrainings (VLPs) and large language models (LLMs) demonstrate the superior performance, speed, and versatility of our proposed LSTP paradigm.

View paper on

Share this with someone who'll enjoy it:

Title:LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding

Paper and Code