Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Global2Local: A Joint-Hierarchical Attention for Video Captioning

Mar 13, 2022

Chengpeng Dai, Fuhai Chen, Xiaoshuai Sun, Rongrong Ji, Qixiang Ye, Yongjian Wu

Figure 1 for Global2Local: A Joint-Hierarchical Attention for Video Captioning

Figure 2 for Global2Local: A Joint-Hierarchical Attention for Video Captioning

Figure 3 for Global2Local: A Joint-Hierarchical Attention for Video Captioning

Figure 4 for Global2Local: A Joint-Hierarchical Attention for Video Captioning

Share this with someone who'll enjoy it:

Abstract:Recently, automatic video captioning has attracted increasing attention, where the core challenge lies in capturing the key semantic items, like objects and actions as well as their spatial-temporal correlations from the redundant frames and semantic content. To this end, existing works select either the key video clips in a global level~(across multi frames), or key regions within each frame, which, however, neglect the hierarchical order, i.e., key frames first and key regions latter. In this paper, we propose a novel joint-hierarchical attention model for video captioning, which embeds the key clips, the key frames and the key regions jointly into the captioning model in a hierarchical manner. Such a joint-hierarchical attention model first conducts a global selection to identify key frames, followed by a Gumbel sampling operation to identify further key regions based on the key frames, achieving an accurate global-to-local feature representation to guide the captioning. Extensive quantitative evaluations on two public benchmark datasets MSVD and MSR-VTT demonstrates the superiority of the proposed method over the state-of-the-art methods.

View paper on

Share this with someone who'll enjoy it:

Title:Global2Local: A Joint-Hierarchical Attention for Video Captioning

Paper and Code