Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dmitry Bogdanov

Supervised contrastive learning from weakly-labeled audio segments for musical version matching

Feb 24, 2025

Joan Serrà, R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji

Abstract:Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to match them at the segment level (e.g., 20s chunks). In addition, existing approaches resort to classification and triplet losses, disregarding more recent losses that could bring meaningful improvements. In this paper, we propose a method to learn from weakly annotated segments, together with a contrastive loss variant that outperforms well-studied alternatives. The former is based on pairwise segment distance reductions, while the latter modifies an existing loss following decoupling, hyper-parameter, and geometric considerations. With these two elements, we do not only achieve state-of-the-art results in the standard track-level evaluation, but we also obtain a breakthrough performance in a segment-level evaluation. We believe that, due to the generality of the challenges addressed here, the proposed methods may find utility in domains beyond audio or musical version matching.

* 15 pages, 6 figures, 7 tables; includes Appendix

Via

Access Paper or Ask Questions

Discogs-VI: A Musical Version Identification Dataset Based on Public Editorial Metadata

Oct 22, 2024

R. Oguz Araz, Xavier Serra, Dmitry Bogdanov

Figure 1 for Discogs-VI: A Musical Version Identification Dataset Based on Public Editorial Metadata

Figure 2 for Discogs-VI: A Musical Version Identification Dataset Based on Public Editorial Metadata

Figure 3 for Discogs-VI: A Musical Version Identification Dataset Based on Public Editorial Metadata

Figure 4 for Discogs-VI: A Musical Version Identification Dataset Based on Public Editorial Metadata

Abstract:Current version identification (VI) datasets often lack sufficient size and musical diversity to train robust neural networks (NNs). Additionally, their non-representative clique size distributions prevent realistic system evaluations. To address these challenges, we explore the untapped potential of the rich editorial metadata in the Discogs music database and create a large dataset of musical versions containing about 1,900,000 versions across 348,000 cliques. Utilizing a high-precision search algorithm, we map this dataset to official music uploads on YouTube, resulting in a dataset of approximately 493,000 versions across 98,000 cliques. This dataset offers over nine times the number of cliques and over four times the number of versions than existing datasets. We demonstrate the utility of our dataset by training a baseline NN without extensive model complexities or data augmentations, which achieves competitive results on the SHS100K and Da-TACOS datasets. Our dataset, along with the tools used for its creation, the extracted audio features, and a trained model, are all publicly available online.

Via

Access Paper or Ask Questions

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Aug 02, 2024

Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, Dmitry Bogdanov

Figure 1 for MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Figure 2 for MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Figure 3 for MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Figure 4 for MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Abstract:Multimodal models that jointly process audio and language hold great promise in audio understanding and are increasingly being adopted in the music domain. By allowing users to query via text and obtain information about a given audio input, these models have the potential to enable a variety of music understanding tasks via language-based interfaces. However, their evaluation poses considerable challenges, and it remains unclear how to effectively assess their ability to correctly interpret music-related inputs with current methods. Motivated by this, we introduce MuChoMusic, a benchmark for evaluating music understanding in multimodal language models focused on audio. MuChoMusic comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets, and covering a wide variety of genres. Questions in the benchmark are crafted to assess knowledge and reasoning abilities across several dimensions that cover fundamental musical concepts and their relation to cultural and functional contexts. Through the holistic analysis afforded by the benchmark, we evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality, pointing to a need for better multimodal integration. Data and code are open-sourced.

* Accepted at ISMIR 2024. Data: https://doi.org/10.5281/zenodo.12709974 Code: https://github.com/mulab-mir/muchomusic Supplementary material: https://mulab-mir.github.io/muchomusic

Via

Access Paper or Ask Questions

Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

Feb 14, 2024

Pablo Alonso-Jiménez, Leonardo Pepino, Roser Batlle-Roca, Pablo Zinemanas, Dmitry Bogdanov, Xavier Serra, Martín Rocamora

Figure 1 for Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

Figure 2 for Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

Figure 3 for Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

Abstract:We present PECMAE, an interpretable model for music audio classification based on prototype learning. Our model is based on a previous method, APNet, which jointly learns an autoencoder and a prototypical network. Instead, we propose to decouple both training processes. This enables us to leverage existing self-supervised autoencoders pre-trained on much larger data (EnCodecMAE), providing representations with better generalization. APNet allows prototypes' reconstruction to waveforms for interpretability relying on the nearest training data samples. In contrast, we explore using a diffusion decoder that allows reconstruction without such dependency. We evaluate our method on datasets for music instrument classification (Medley-Solos-DB) and genre recognition (GTZAN and a larger in-house dataset), the latter being a more challenging task not addressed with prototypical networks before. We find that the prototype-based models preserve most of the performance achieved with the autoencoder embeddings, while the sonification of prototypes benefits understanding the behavior of the classifier.

Via

Access Paper or Ask Questions

mir_ref: A Representation Evaluation Framework for Music Information Retrieval Tasks

Dec 12, 2023

Christos Plachouras, Pablo Alonso-Jiménez, Dmitry Bogdanov

Abstract:Music Information Retrieval (MIR) research is increasingly leveraging representation learning to obtain more compact, powerful music audio representations for various downstream MIR tasks. However, current representation evaluation methods are fragmented due to discrepancies in audio and label preprocessing, downstream model and metric implementations, data availability, and computational resources, often leading to inconsistent and limited results. In this work, we introduce mir_ref, an MIR Representation Evaluation Framework focused on seamless, transparent, local-first experiment orchestration to support representation development. It features implementations of a variety of components such as MIR datasets, tasks, embedding models, and tools for result analysis and visualization, while facilitating the implementation of custom components. To demonstrate its utility, we use it to conduct an extensive evaluation of several embedding models across various tasks and datasets, including evaluating their robustness to various audio perturbations and the ease of extracting relevant information from them.

* Machine Learning for Audio Workshop, Neural Information Processing Systems (NeurIPS) 2023, New Orleans, LA

Via

Access Paper or Ask Questions

The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Nov 22, 2023

Ilaria Manco, Benno Weck, SeungHeon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos(+3 more)

Figure 1 for The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Figure 2 for The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Figure 3 for The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Figure 4 for The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Abstract:We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-and-language models. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-to-music generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance.

* Accepted to NeurIPS 2023 Workshop on Machine Learning for Audio

Via

Access Paper or Ask Questions

Efficient Supervised Training of Audio Transformers for Music Representation Learning

Sep 28, 2023

Pablo Alonso-Jiménez, Xavier Serra, Dmitry Bogdanov

Abstract:In this work, we address music representation learning using convolution-free transformers. We build on top of existing spectrogram-based audio transformers such as AST and train our models on a supervised task using patchout training similar to PaSST. In contrast to previous works, we study how specific design decisions affect downstream music tagging tasks instead of focusing on the training task. We assess the impact of initializing the models with different pre-trained weights, using various input audio segment lengths, using learned representations from different blocks and tokens of the transformer for downstream tasks, and applying patchout at inference to speed up feature extraction. We find that 1) initializing the model from ImageNet or AudioSet weights and using longer input segments are beneficial both for the training and downstream tasks, 2) the best representations for the considered downstream tasks are located in the middle blocks of the transformer, and 3) using patchout at inference allows faster processing than our convolutional baselines while maintaining superior performance. The resulting models, MAEST, are publicly available and obtain the best performance among open models in music tagging tasks.

* Accepted at the 2023 International Society for Music Information Retrieval Conference (ISMIR'23)

Via

Access Paper or Ask Questions

Pre-Training Strategies Using Contrastive Learning and Playlist Information for Music Classification and Similarity

Apr 24, 2023

Pablo Alonso-Jiménez, Xavier Favory, Hadrien Foroughmand, Grigoris Bourdalas, Xavier Serra, Thomas Lidy, Dmitry Bogdanov

Figure 1 for Pre-Training Strategies Using Contrastive Learning and Playlist Information for Music Classification and Similarity

Figure 2 for Pre-Training Strategies Using Contrastive Learning and Playlist Information for Music Classification and Similarity

Figure 3 for Pre-Training Strategies Using Contrastive Learning and Playlist Information for Music Classification and Similarity

Figure 4 for Pre-Training Strategies Using Contrastive Learning and Playlist Information for Music Classification and Similarity

Abstract:In this work, we investigate an approach that relies on contrastive learning and music metadata as a weak source of supervision to train music representation models. Recent studies show that contrastive learning can be used with editorial metadata (e.g., artist or album name) to learn audio representations that are useful for different classification tasks. In this paper, we extend this idea to using playlist data as a source of music similarity information and investigate three approaches to generate anchor and positive track pairs. We evaluate these approaches by fine-tuning the pre-trained models for music multi-label classification tasks (genre, mood, and instrument tagging) and music similarity. We find that creating anchor and positive track pairs by relying on co-occurrences in playlists provides better music similarity and competitive classification results compared to choosing tracks from the same artist as in previous works. Additionally, our best pre-training approach based on playlists provides superior classification performance for most datasets.

* Accepted at the 2023 International Conference on Acoustics, Speech, and Signal Processing (ICASSP'23)

Via

Access Paper or Ask Questions

Enriched Music Representations with Multiple Cross-modal Contrastive Learning

Apr 01, 2021

Andres Ferraro, Xavier Favory, Konstantinos Drossos, Yuntae Kim, Dmitry Bogdanov

Figure 1 for Enriched Music Representations with Multiple Cross-modal Contrastive Learning

Figure 2 for Enriched Music Representations with Multiple Cross-modal Contrastive Learning

Figure 3 for Enriched Music Representations with Multiple Cross-modal Contrastive Learning

Figure 4 for Enriched Music Representations with Multiple Cross-modal Contrastive Learning

Abstract:Modeling various aspects that make a music piece unique is a challenging task, requiring the combination of multiple sources of information. Deep learning is commonly used to obtain representations using various sources of information, such as the audio, interactions between users and songs, or associated genre metadata. Recently, contrastive learning has led to representations that generalize better compared to traditional supervised methods. In this paper, we present a novel approach that combines multiple types of information related to music using cross-modal contrastive learning, allowing us to learn an audio feature from heterogeneous data simultaneously. We align the latent representations obtained from playlists-track interactions, genre metadata, and the tracks' audio, by maximizing the agreement between these modality representations using a contrastive loss. We evaluate our approach in three tasks, namely, genre classification, playlist continuation and automatic tagging. We compare the performances with a baseline audio-based CNN trained to predict these modalities. We also study the importance of including multiple sources of information when training our embedding model. The results suggest that the proposed method outperforms the baseline in all the three downstream tasks and achieves comparable performance to the state-of-the-art.

* Accepted for publication to IEEE Signal Processing Letters

Via

Access Paper or Ask Questions

Melon Playlist Dataset: a public dataset for audio-based playlist generation and music tagging

Jan 30, 2021

Andres Ferraro, Yuntae Kim, Soohyeon Lee, Biho Kim, Namjun Jo, Semi Lim, Suyon Lim, Jungtaek Jang, Sehwan Kim, Xavier Serra(+1 more)

Figure 1 for Melon Playlist Dataset: a public dataset for audio-based playlist generation and music tagging

Figure 2 for Melon Playlist Dataset: a public dataset for audio-based playlist generation and music tagging

Figure 3 for Melon Playlist Dataset: a public dataset for audio-based playlist generation and music tagging

Figure 4 for Melon Playlist Dataset: a public dataset for audio-based playlist generation and music tagging

Abstract:One of the main limitations in the field of audio signal processing is the lack of large public datasets with audio representations and high-quality annotations due to restrictions of copyrighted commercial music. We present Melon Playlist Dataset, a public dataset of mel-spectrograms for 649,091tracks and 148,826 associated playlists annotated by 30,652 different tags. All the data is gathered from Melon, a popular Korean streaming service. The dataset is suitable for music information retrieval tasks, in particular, auto-tagging and automatic playlist continuation. Even though the latter can be addressed by collaborative filtering approaches, audio provides opportunities for research on track suggestions and building systems resistant to the cold-start problem, for which we provide a baseline. Moreover, the playlists and the annotations included in the Melon Playlist Dataset make it suitable for metric learning and representation learning.

* 2021 IEEE International Conference on Acoustics, Speech and Signal Processing

Via

Access Paper or Ask Questions