Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

Feb 21, 2022

Yoshihiro Yamazaki, Shota Orihashi, Ryo Masumura, Mihiro Uchida, Akihiko Takashima

Figure 1 for Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

Figure 2 for Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

Figure 3 for Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

Figure 4 for Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

Share this with someone who'll enjoy it:

Abstract:There have been many attempts to build multimodal dialog systems that can respond to a question about given audio-visual information, and the representative task for such systems is the Audio Visual Scene-Aware Dialog (AVSD). Most conventional AVSD models adopt the Convolutional Neural Network (CNN)-based video feature extractor to understand visual information. While a CNN tends to obtain both temporally and spatially local information, global information is also crucial for boosting video understanding because AVSD requires long-term temporal visual dependency and whole visual information. In this study, we apply the Transformer-based video feature that can capture both temporally and spatially global representations more efficiently than the CNN-based feature. Our AVSD model with its Transformer-based feature attains higher objective performance scores for answer generation. In addition, our model achieves a subjective score close to that of human answers in DSTC10. We observed that the Transformer-based visual feature is beneficial for the AVSD task because our model tends to correctly answer the questions that need a temporally and spatially broad range of visual information.

* Accepted at DSTC10 Workshop at AAAI 2022

View paper on

Share this with someone who'll enjoy it:

Title:Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

Paper and Code