Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling

Oct 19, 2024

Hao Wu, Donglin Bai, Shiqi Jiang, Qianxi Zhang, Yifan Yang, Ting Cao, Fengyuan Xu

Figure 1 for Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling

Figure 2 for Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling

Figure 3 for Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling

Figure 4 for Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling

Share this with someone who'll enjoy it:

Abstract:Video understanding has become increasingly important with the rise of multi-modality applications. Understanding continuous video poses considerable challenges due to the fast expansion of streaming video, which contains multi-scale and untrimmed events. We introduce a novel system, C-VUE, to overcome these issues through adaptive state modeling. C-VUE has three key designs. The first is a long-range history modeling technique that uses a video-aware approach to retain historical video information. The second is a spatial redundancy reduction technique, which enhances the efficiency of history modeling based on temporal relations. The third is a parallel training structure that incorporates the frame-weighted loss to understand multi-scale events in long videos. Our C-VUE offers high accuracy and efficiency. It runs at speeds >30 FPS on typical edge devices and outperforms all baselines in accuracy. Moreover, applying C-VUE to a video foundation model as a video encoder in our case study resulted in a 0.46-point enhancement (on a 5-point scale) on the in-distribution dataset, and an improvement ranging from 1.19\% to 4\% on zero-shot datasets.

View paper on

Share this with someone who'll enjoy it:

Title:Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling

Paper and Code