Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anuva Agarwal

Grounding Complex Navigational Instructions Using Scene Graphs

Jun 03, 2021

Michiel de Jong, Satyapriya Krishna, Anuva Agarwal

Figure 1 for Grounding Complex Navigational Instructions Using Scene Graphs

Figure 2 for Grounding Complex Navigational Instructions Using Scene Graphs

Figure 3 for Grounding Complex Navigational Instructions Using Scene Graphs

Figure 4 for Grounding Complex Navigational Instructions Using Scene Graphs

Abstract:Training a reinforcement learning agent to carry out natural language instructions is limited by the available supervision, i.e. knowing when the instruction has been carried out. We adapt the CLEVR visual question answering dataset to generate complex natural language navigation instructions and accompanying scene graphs, yielding an environment-agnostic supervised dataset. To demonstrate the use of this data set, we map the scenes to the VizDoom environment and use the architecture in \citet{gatedattention} to train an agent to carry out these more complex language instructions.

* arXiv admin note: text overlap with arXiv:1706.07230 by other authors

Via

Access Paper or Ask Questions

Interactive Image Generation Using Scene Graphs

May 09, 2019

Gaurav Mittal, Shubham Agrawal, Anuva Agarwal, Sushant Mehta, Tanya Marwah

Figure 1 for Interactive Image Generation Using Scene Graphs

Figure 2 for Interactive Image Generation Using Scene Graphs

Figure 3 for Interactive Image Generation Using Scene Graphs

Figure 4 for Interactive Image Generation Using Scene Graphs

Abstract:Recent years have witnessed some exciting developments in the domain of generating images from scene-based text descriptions. These approaches have primarily focused on generating images from a static text description and are limited to generating images in a single pass. They are unable to generate an image interactively based on an incrementally additive text description (something that is more intuitive and similar to the way we describe an image). We propose a method to generate an image incrementally based on a sequence of graphs of scene descriptions (scene-graphs). We propose a recurrent network architecture that preserves the image content generated in previous steps and modifies the cumulative image as per the newly provided scene information. Our model utilizes Graph Convolutional Networks (GCN) to cater to variable-sized scene graphs along with Generative Adversarial image translation networks to generate realistic multi-object images without needing any intermediate supervision during training. We experiment with Coco-Stuff dataset which has multi-object images along with annotations describing the visual scene and show that our model significantly outperforms other approaches on the same dataset in generating visually consistent images for incrementally growing scene graphs.

* Published at ICLR 2019 Deep Generative Models for Highly Structured Data Workshop

Via

Access Paper or Ask Questions

ExCL: Extractive Clip Localization Using Natural Language Descriptions

Apr 04, 2019

Soham Ghosh, Anuva Agarwal, Zarana Parekh, Alexander Hauptmann

Figure 1 for ExCL: Extractive Clip Localization Using Natural Language Descriptions

Figure 2 for ExCL: Extractive Clip Localization Using Natural Language Descriptions

Figure 3 for ExCL: Extractive Clip Localization Using Natural Language Descriptions

Abstract:The task of retrieving clips within videos based on a given natural language query requires cross-modal reasoning over multiple frames. Prior approaches such as sliding window classifiers are inefficient, while text-clip similarity driven ranking-based approaches such as segment proposal networks are far more complicated. In order to select the most relevant video clip corresponding to the given text description, we propose a novel extractive approach that predicts the start and end frames by leveraging cross-modal interactions between the text and video - this removes the need to retrieve and re-rank multiple proposal segments. Using recurrent networks we encode the two modalities into a joint representation which is then used in different variants of start-end frame predictor networks. Through extensive experimentation and ablative analysis, we demonstrate that our simple and elegant approach significantly outperforms state of the art on two datasets and has comparable performance on a third.

* Accepted at NAACL 2019, Short Paper

Via

Access Paper or Ask Questions