Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Nov 01, 2020

Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox

Figure 1 for COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Figure 2 for COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Figure 3 for COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Figure 4 for COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Share this with someone who'll enjoy it:

Abstract:Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

* 27 pages, 5 figures, 19 tables. To be published in the 34th conference on Neural Information Processing Systems (NeurIPS 2020). The first two authors contributed equally to this work

View paper on

Share this with someone who'll enjoy it:

Title:COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Paper and Code