Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Seong-Ho Choi

Multimodal Dual Attention Memory for Video Story Question Answering

Sep 21, 2018

Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, Byoung-Tak Zhang

Figure 1 for Multimodal Dual Attention Memory for Video Story Question Answering

Figure 2 for Multimodal Dual Attention Memory for Video Story Question Answering

Figure 3 for Multimodal Dual Attention Memory for Video Story Question Answering

Figure 4 for Multimodal Dual Attention Memory for Video Story Question Answering

Abstract:We propose a video story question-answering (QA) architecture, Multimodal Dual Attention Memory (MDAM). The key idea is to use a dual attention mechanism with late fusion. MDAM uses self-attention to learn the latent concepts in scene frames and captions. Given a question, MDAM uses the second attention over these latent concepts. Multimodal fusion is performed after the dual attention processes (late fusion). Using this processing pipeline, MDAM learns to infer a high-level vision-language joint representation from an abstraction of the full video content. We evaluate MDAM on PororoQA and MovieQA datasets which have large-scale QA annotations on cartoon videos and movies, respectively. For both datasets, MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models. We confirm the best performance of the dual attention mechanism combined with late fusion by ablation studies. We also perform qualitative analysis by visualizing the inference mechanisms of MDAM.

* Accepted for ECCV 2018

Via

Access Paper or Ask Questions

DeepStory: Video Story QA by Deep Embedded Memory Networks

Jul 04, 2017

Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, Byoung-Tak Zhang

Figure 1 for DeepStory: Video Story QA by Deep Embedded Memory Networks

Figure 2 for DeepStory: Video Story QA by Deep Embedded Memory Networks

Figure 3 for DeepStory: Video Story QA by Deep Embedded Memory Networks

Figure 4 for DeepStory: Video Story QA by Deep Embedded Memory Networks

Abstract:Question-answering (QA) on video contents is a significant challenge for achieving human-level intelligence as it involves both vision and language in real-world settings. Here we demonstrate the possibility of an AI agent performing video story QA by learning from a large amount of cartoon videos. We develop a video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data. The video stories are stored in a long-term memory component. For a given question, an LSTM-based attention model uses the long-term memory to recall the best question-story-answer triplet by focusing on specific words containing key information. We trained the DEMN on a novel QA dataset of children's cartoon video series, Pororo. The dataset contains 16,066 scene-dialogue pairs of 20.5-hour videos, 27,328 fine-grained sentences for scene description, and 8,913 story-related QA pairs. Our experimental results show that the DEMN outperforms other QA models. This is mainly due to 1) the reconstruction of video stories in a scene-dialogue combined form that utilize the latent embedding and 2) attention. DEMN also achieved state-of-the-art results on the MovieQA benchmark.

* 7 pages, accepted for IJCAI 2017

Via

Access Paper or Ask Questions