Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

H Lilian Tang

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Aug 05, 2021

Xinhao Mei, Qiushi Huang, Xubo Liu, Gengyun Chen, Jingqian Wu, Yusong Wu, Jinzheng Zhao, Shengchen Li, Tom Ko, H Lilian Tang(+3 more)

Figure 1 for An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Figure 2 for An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Figure 3 for An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Figure 4 for An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Abstract:Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features extracted by the encoder. To improve the proposed system, transfer learning from either an upstream audio-related task or a large in-domain dataset is introduced to mitigate the problem induced by data scarcity. Besides, evaluation metrics are incorporated into the optimization of the model with reinforcement learning, which helps address the problem of ``exposure bias'' induced by ``teacher forcing'' training strategy and the mismatch between the evaluation metrics and the loss function. The resulting system was ranked 3rd in DCASE 2021 Task 6. Ablation studies are carried out to investigate how much each element in the proposed system can contribute to final performance. The results show that the proposed techniques significantly improve the scores of the evaluation metrics, however, reinforcement learning may impact adversely on the quality of the generated captions.

* 5 pages, 1 figure, submitted to DCASE 2021 workshop

Via

Access Paper or Ask Questions

CL4AC: A Contrastive Loss for Audio Captioning

Jul 21, 2021

Xubo Liu, Qiushi Huang, Xinhao Mei, Tom Ko, H Lilian Tang, Mark D. Plumbley, Wenwu Wang

Figure 1 for CL4AC: A Contrastive Loss for Audio Captioning

Figure 2 for CL4AC: A Contrastive Loss for Audio Captioning

Figure 3 for CL4AC: A Contrastive Loss for Audio Captioning

Figure 4 for CL4AC: A Contrastive Loss for Audio Captioning

Abstract:Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded into a latent representation, and aligned with its corresponding text descriptions, then a decoder is used to generate the captions. However, training of an AAC system often encounters the problem of data scarcity, which may lead to inaccurate representation and audio-text alignment. To address this problem, we propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment between audio and texts, while trained with limited data. Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.

* The first two authors contributed equally, 5 pages, 3 figures, submitted to DCASE2021 Workshop

Via

Access Paper or Ask Questions

Token-Level Supervised Contrastive Learning for Punctuation Restoration

Jul 19, 2021

Qiushi Huang, Tom Ko, H Lilian Tang, Xubo Liu, Bo Wu

Figure 1 for Token-Level Supervised Contrastive Learning for Punctuation Restoration

Figure 2 for Token-Level Supervised Contrastive Learning for Punctuation Restoration

Figure 3 for Token-Level Supervised Contrastive Learning for Punctuation Restoration

Figure 4 for Token-Level Supervised Contrastive Learning for Punctuation Restoration

Abstract:Punctuation is critical in understanding natural language text. Currently, most automatic speech recognition (ASR) systems do not generate punctuation, which affects the performance of downstream tasks, such as intent detection and slot filling. This gives rise to the need for punctuation restoration. Recent work in punctuation restoration heavily utilizes pre-trained language models without considering data imbalance when predicting punctuation classes. In this work, we address this problem by proposing a token-level supervised contrastive learning method that aims at maximizing the distance of representation of different punctuation marks in the embedding space. The result shows that training with token-level supervised contrastive learning obtains up to 3.2% absolute F1 improvement on the test set.

* 5 pages, 3 figures

Via

Access Paper or Ask Questions