Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos

Oct 11, 2021

Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, Gunhee Kim

$Figure 1 for Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos$

$Figure 2 for Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos$

$Figure 3 for Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos$

$Figure 4 for Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos$

Share this with someone who'll enjoy it:

Abstract:360$^\circ$ videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond pre-determined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360$^\circ$ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.

* Published to ICCV2021

View paper on

Share this with someone who'll enjoy it:

Title:Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos

Paper and Code