Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fei-Fei Li

DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer

Mar 30, 2018

Joseph Suarez, Justin Johnson, Fei-Fei Li

Figure 1 for DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer

Figure 2 for DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer

Figure 3 for DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer

Figure 4 for DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer

Abstract:We present a novel Dynamic Differentiable Reasoning (DDR) framework for jointly learning branching programs and the functions composing them; this resolves a significant nondifferentiability inhibiting recent dynamic architectures. We apply our framework to two settings in two highly compact and data efficient architectures: DDRprog for CLEVR Visual Question Answering and DDRstack for reverse Polish notation expression evaluation. DDRprog uses a recurrent controller to jointly predict and execute modular neural programs that directly correspond to the underlying question logic; it explicitly forks subprocesses to handle logical branching. By effectively leveraging additional structural supervision, we achieve a large improvement over previous approaches in subtask consistency and a small improvement in overall accuracy. We further demonstrate the benefits of structural supervision in the RPN setting: the inclusion of a stack assumption in DDRstack allows our approach to generalize to long expressions where an LSTM fails the task.

Via

Access Paper or Ask Questions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Feb 23, 2016

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma(+2 more)

Figure 1 for Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Figure 2 for Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Figure 3 for Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Figure 4 for Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Abstract:Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that "the person is riding a horse-drawn carriage". In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answers.

* 44 pages, 37 figures

Via

Access Paper or Ask Questions

Love Thy Neighbors: Image Annotation by Exploiting Image Metadata

Sep 22, 2015

Justin Johnson, Lamberto Ballan, Fei-Fei Li

Figure 1 for Love Thy Neighbors: Image Annotation by Exploiting Image Metadata

Figure 2 for Love Thy Neighbors: Image Annotation by Exploiting Image Metadata

Figure 3 for Love Thy Neighbors: Image Annotation by Exploiting Image Metadata

Figure 4 for Love Thy Neighbors: Image Annotation by Exploiting Image Metadata

Abstract:Some images that are difficult to recognize on their own may become more clear in the context of a neighborhood of related images with similar social-network metadata. We build on this intuition to improve multilabel image annotation. Our model uses image metadata nonparametrically to generate neighborhoods of related images using Jaccard similarities, then uses a deep neural network to blend visual information from the image and its neighbors. Prior work typically models image metadata parametrically, in contrast, our nonparametric treatment allows our model to perform well even when the vocabulary of metadata changes between training and testing. We perform comprehensive experiments on the NUS-WIDE dataset, where we show that our model outperforms state-of-the-art methods for multilabel image annotation even when our model is forced to generalize to new types of metadata.

* Accepted to ICCV 2015

Via

Access Paper or Ask Questions