Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

Dec 14, 2020

Yuma Koizumi, Yasunori Ohishi, Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda

Figure 1 for Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

Figure 2 for Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

Figure 3 for Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

Figure 4 for Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

Share this with someone who'll enjoy it:

Abstract:The goal of audio captioning is to translate input audio into its description using natural language. One of the problems in audio captioning is the lack of training data due to the difficulty in collecting audio-caption pairs by crawling the web. In this study, to overcome this problem, we propose to use a pre-trained large-scale language model. Since an audio input cannot be directly inputted into such a language model, we utilize guidance captions retrieved from a training dataset based on similarities that may exist in different audio. Then, the caption of the audio input is generated by using a pre-trained language model while referring to the guidance captions. Experimental results show that (i) the proposed method has succeeded to use a pre-trained language model for audio captioning, and (ii) the oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.

* Submitted to ICASSP 2021

View paper on

Share this with someone who'll enjoy it:

Title:Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

Paper and Code