Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mohammad Saffar

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

May 02, 2022

AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, Anelia Angelova

Figure 1 for Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

Figure 2 for Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

Figure 3 for Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

Figure 4 for Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

Abstract:We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning. In contrast to previous works using contrastive or generative captioning training, we propose a novel and simple recipe to pre-train a vision-language joint model, which is multi-task as well. The pre-training uses only noisy image captioning data, and is formulated to use the entire architecture end-to-end with both a strong language encoder and decoder. Our results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results across a variety of question answering tasks. Our multi-task mixture training learns from tasks of various question intents and thus generalizes better, including on zero-shot vision-language tasks. We conduct experiments in the challenging multi-task and open-vocabulary settings and across a variety of datasets and tasks, such as VQA2.0, SNLI-VE, NLVR2, GQA, VizWiz. We observe that the proposed approach is able to generalize to unseen tasks and that more diverse mixtures lead to higher accuracy in both known and novel tasks.

Via

Access Paper or Ask Questions

FindIt: Generalized Localization with Natural Language Queries

Mar 31, 2022

Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova

Figure 1 for FindIt: Generalized Localization with Natural Language Queries

Figure 2 for FindIt: Generalized Localization with Natural Language Queries

Figure 3 for FindIt: Generalized Localization with Natural Language Queries

Figure 4 for FindIt: Generalized Localization with Natural Language Queries

Abstract:We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection. Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements across the tasks. In addition, we discover that a standard object detector is surprisingly effective in unifying these tasks without a need for task-specific design, losses, or pre-computed detections. Our end-to-end trainable framework responds flexibly and accurately to a wide range of referring expression, localization or detection queries for zero, one, or multiple objects. Jointly trained on these tasks, FindIt outperforms the state of the art on both referring expression and text-based localization, and shows competitive performance on object detection. Finally, FindIt generalizes better to out-of-distribution data and novel categories compared to strong single-task baselines. All of these are accomplished by a single, unified and efficient model. The code will be released.

* Tech report

Via

Access Paper or Ask Questions

Efficient Content-Based Sparse Attention with Routing Transformers

Mar 12, 2020

Aurko Roy, Mohammad Saffar, Ashish Vaswani, David Grangier

Figure 1 for Efficient Content-Based Sparse Attention with Routing Transformers

Figure 2 for Efficient Content-Based Sparse Attention with Routing Transformers

Figure 3 for Efficient Content-Based Sparse Attention with Routing Transformers

Figure 4 for Efficient Content-Based Sparse Attention with Routing Transformers

Abstract:Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic compute and memory requirements with respect to sequence length. Successful approaches to reduce this complexity focused on attending to local sliding windows or a small set of locations independent of content. Our work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest. This work builds upon two lines of research: it combines the modeling flexibility of prior work on content-based sparse attention with the efficiency gains from approaches based on local, temporal sparse attention. Our model, the Routing Transformer, endows self-attention with a sparse routing module based on online k-means while reducing the overall complexity of attention to $O\left(n^{1.5}d\right)$ from $O\left(n^2d\right)$ for sequence length $n$ and hidden dimension $d$. We show that our model outperforms comparable sparse attention models on language modeling on Wikitext-103 (15.8 vs 18.3 perplexity) as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while using fewer self-attention layers.

Via

Access Paper or Ask Questions