Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:A-JEPA: Joint-Embedding Predictive Architecture Can Listen

Nov 28, 2023

Zhengcong Fei, Mingyuan Fan, Junshi Huang

Figure 1 for A-JEPA: Joint-Embedding Predictive Architecture Can Listen

Figure 2 for A-JEPA: Joint-Embedding Predictive Architecture Can Listen

Figure 3 for A-JEPA: Joint-Embedding Predictive Architecture Can Listen

Figure 4 for A-JEPA: Joint-Embedding Predictive Architecture Can Listen

Share this with someone who'll enjoy it:

Abstract:This paper presents that the masked-modeling principle driving the success of large foundational vision models can be effectively applied to audio by making predictions in a latent space. We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations. The target representations of those regions are extracted by the exponential moving average of context encoder, \emph{i.e.}, target encoder, on the whole spectrogram. We find it beneficial to transfer random block masking into time-frequency aware masking in a curriculum manner, considering the complexity of highly correlated in local time and frequency in audio spectrograms. To enhance contextual semantic understanding and robustness, we fine-tune the encoder with a regularized masking on target datasets, instead of input dropping or zero. Empirically, when built with Vision Transformers structure, we find A-JEPA to be highly scalable and sets new state-of-the-art performance on multiple audio and speech classification tasks, outperforming other recent models that use externally supervised pre-training.

View paper on

Share this with someone who'll enjoy it:

Title:A-JEPA: Joint-Embedding Predictive Architecture Can Listen

Paper and Code