Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

Jan 08, 2024

Yueqian Wang, Yuxuan Wang, Kai Chen, Dongyan Zhao

Figure 1 for STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

Figure 2 for STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

Figure 3 for STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

Figure 4 for STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

Share this with someone who'll enjoy it:

Abstract:Recently we have witnessed the rapid development of video question answering models. However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos. To tackle this problem we propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering. STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks, and a set of lightweight neural modules to complete each of these sub-tasks. Though neural module networks are already widely studied on image-text tasks, applying them to videos is a non-trivial task, as reasoning on videos requires different abilities. In this paper, we define a set of basic video-text sub-tasks for video question answering and design a set of lightweight modules to complete them. Different from most prior works, modules of STAIR return intermediate outputs specific to their intentions instead of always returning attention maps, which makes it easier to interpret and collaborate with pre-trained models. We also introduce intermediate supervision to make these intermediate outputs more accurate. We conduct extensive experiments on several video question answering datasets under various settings to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available. Code: https://github.com/yellow-binary-tree/STAIR

* To appear in AAAI 2024

View paper on

Share this with someone who'll enjoy it:

Title:STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

Paper and Code