Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Akhilesh Bharadwaj

Exploration of Visual Features and their weighted-additive fusion for Video Captioning

Jan 14, 2021

Praveen S V, Akhilesh Bharadwaj, Harsh Raj, Janhavi Dadhania, Ganesh Samarth C. A, Nikhil Pareek, S R M Prasanna

Figure 1 for Exploration of Visual Features and their weighted-additive fusion for Video Captioning

Figure 2 for Exploration of Visual Features and their weighted-additive fusion for Video Captioning

Figure 3 for Exploration of Visual Features and their weighted-additive fusion for Video Captioning

Figure 4 for Exploration of Visual Features and their weighted-additive fusion for Video Captioning

Abstract:Video captioning is a popular task that challenges models to describe events in videos using natural language. In this work, we investigate the ability of various visual feature representations derived from state-of-the-art convolutional neural networks to capture high-level semantic context. We introduce the Weighted Additive Fusion Transformer with Memory Augmented Encoders (WAFTM), a captioning model that incorporates memory in a transformer encoder and uses a novel method, to fuse features, that ensures due importance is given to more significant representations. We illustrate a gain in performance realized by applying Word-Piece Tokenization and a popular REINFORCE algorithm. Finally, we benchmark our model on two datasets and obtain a CIDEr of 92.4 on MSVD and a METEOR of 0.091 on the ActivityNet Captions Dataset.

* 6 pages

Via

Access Paper or Ask Questions