Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:A Novel Attention-based Aggregation Function to Combine Vision and Language

Apr 27, 2020

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Figure 1 for A Novel Attention-based Aggregation Function to Combine Vision and Language

Figure 2 for A Novel Attention-based Aggregation Function to Combine Vision and Language

Figure 3 for A Novel Attention-based Aggregation Function to Combine Vision and Language

Figure 4 for A Novel Attention-based Aggregation Function to Combine Vision and Language

Share this with someone who'll enjoy it:

Abstract:The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements -- like regions and words -- proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.

View paper on

Share this with someone who'll enjoy it:

Title:A Novel Attention-based Aggregation Function to Combine Vision and Language

Paper and Code