Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:RTQ: Rethinking Video-language Understanding Based on Image-text Model

Dec 01, 2023

Xiao Wang, Yaoyu Li, Tian Gan, Zheng Zhang, Jingjing Lv, Liqiang Nie

Figure 1 for RTQ: Rethinking Video-language Understanding Based on Image-text Model

Figure 2 for RTQ: Rethinking Video-language Understanding Based on Image-text Model

Figure 3 for RTQ: Rethinking Video-language Understanding Based on Image-text Model

Figure 4 for RTQ: Rethinking Video-language Understanding Based on Image-text Model

Share this with someone who'll enjoy it:

Abstract:Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods.

* In International Conference on Multimedia. ACM, 557--566 (2023) * Accepted by ACM MM 2023 as Oral representation

View paper on

Share this with someone who'll enjoy it:

Title:RTQ: Rethinking Video-language Understanding Based on Image-text Model

Paper and Code