Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

Dec 16, 2021

Yanpeng Zhao, Jack Hessel, Youngjae Yu, Ximing Lu, Rowan Zellers, Yejin Choi

Figure 1 for Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

Figure 2 for Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

Figure 3 for Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

Figure 4 for Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

Share this with someone who'll enjoy it:

Abstract:Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning systems. Prevailing learning paradigms have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces \textbf{A}udio-\textbf{T}ext alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a tri-modal embedding space implicitly. In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2\% R@1. We further investigate cases of minimal audio-text supervision, finding that, e.g., just a few hundred supervised audio-text pairs increase the zero-shot audio classification accuracy by 8\% on US8K. However, to match human parity on some zero-shot tasks, our empirical scaling experiments suggest that we would need about $2^{21} \approx 2M$ supervised audio-caption pairs. Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.

* Our code is available at https://github.com/zhaoyanpeng/vipant

View paper on

Share this with someone who'll enjoy it:

Title:Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

Paper and Code