Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jinjoo Lee

EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance

Sep 02, 2024

Jaeyeon Kim, Minjeon Jeon, Jaeyoon Jung, Sang Hoon Woo, Jinjoo Lee

Figure 1 for EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance

Figure 2 for EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance

Figure 3 for EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance

Figure 4 for EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance

Abstract:In this work, we aim to analyze and optimize the EnCLAP framework, a state-of-the-art model in automated audio captioning. We investigate the impact of modifying the acoustic encoder components, explore pretraining with different dataset scales, and study the effectiveness of a reranking scheme. Through extensive experimentation and quantitative analysis of generated captions, we develop EnCLAP++, an enhanced version that significantly surpasses the original.

* Accepted to DCASE2024 Workshop

Via

Access Paper or Ask Questions

Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning

Sep 02, 2024

Jaeyeon Kim, Jaeyoon Jung, Minjeong Jeon, Sang Hoon Woo, Jinjoo Lee

Abstract:In this technical report, we describe our submission to DCASE2024 Challenge Task6 (Automated Audio Captioning) and Task8 (Language-based Audio Retrieval). We develop our approach building upon the EnCLAP audio captioning framework and optimizing it for Task6 of the challenge. Notably, we outline the changes in the underlying components and the incorporation of the reranking process. Additionally, we submit a supplementary retriever model, a byproduct of our modified framework, to Task8. Our proposed systems achieve FENSE score of 0.542 on Task6 and mAP@10 score of 0.386 on Task8, significantly outperforming the baseline models.

* DCASE2024 Challenge Technical Report. Ranked 2nd in Task 6 Automated Audio Captioning

Via

Access Paper or Ask Questions

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Jan 31, 2024

Jaeyeon Kim, Jaeyoon Jung, Jinjoo Lee, Sang Hoon Woo

Abstract:We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained language model, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model. Experimental results on AudioCaps and Clotho demonstrate that our model surpasses the performance of baseline models. Source code will be available at https://github.com/jaeyeonkim99/EnCLAP . An online demo is available at https://huggingface.co/spaces/enclap-team/enclap .

* Accepted to ICASSP 2024

Via

Access Paper or Ask Questions