Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Collaborative Three-Stream Transformers for Video Captioning

Sep 18, 2023

Hao Wang, Libo Zhang, Heng Fan, Tiejian Luo

Figure 1 for Collaborative Three-Stream Transformers for Video Captioning

Figure 2 for Collaborative Three-Stream Transformers for Video Captioning

Figure 3 for Collaborative Three-Stream Transformers for Video Captioning

Figure 4 for Collaborative Three-Stream Transformers for Video Captioning

Share this with someone who'll enjoy it:

Abstract:As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation. Specifically, COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial-temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions. The whole model is trained in an end-to-end fashion. Extensive experiments conducted on three large-scale challenging datasets, i.e., YouCookII, ActivityNet Captions and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods.

* Accepted by CVIU

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:Collaborative Three-Stream Transformers for Video Captioning

Paper and Code