Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Benno Weck

CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining

Mar 29, 2025

Tristan Tsoi, Jiajun Deng, Yaolong Ju, Benno Weck, Holger Kirchhoff, Simon Lui

Abstract:Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs' comprehensive music knowledge to generate contextually rich descriptions. Exten1sive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform.

* Accepted by ICME2025

Via

Access Paper or Ask Questions

The language of sound search: Examining User Queries in Audio Search Engines

Oct 10, 2024

Benno Weck, Frederic Font

Figure 1 for The language of sound search: Examining User Queries in Audio Search Engines

Figure 2 for The language of sound search: Examining User Queries in Audio Search Engines

Figure 3 for The language of sound search: Examining User Queries in Audio Search Engines

Figure 4 for The language of sound search: Examining User Queries in Audio Search Engines

Abstract:This study examines textual, user-written search queries within the context of sound search engines, encompassing various applications such as foley, sound effects, and general audio retrieval. Current research inadequately addresses real-world user needs and behaviours in designing text-based audio retrieval systems. To bridge this gap, we analysed search queries from two sources: a custom survey and Freesound website query logs. The survey was designed to collect queries for an unrestricted, hypothetical sound search engine, resulting in a dataset that captures user intentions without the constraints of existing systems. This dataset is also made available for sharing with the research community. In contrast, the Freesound query logs encompass approximately 9 million search requests, providing a comprehensive view of real-world usage patterns. Our findings indicate that survey queries are generally longer than Freesound queries, suggesting users prefer detailed queries when not limited by system constraints. Both datasets predominantly feature keyword-based queries, with few survey participants using full sentences. Key factors influencing survey queries include the primary sound source, intended usage, perceived location, and the number of sound sources. These insights are crucial for developing user-centred, effective text-based audio retrieval systems, enhancing our understanding of user behaviour in sound search contexts.

* Accepted at DCASE 2024. Supplementary materials at https://doi.org/10.5281/zenodo.13622537

Via

Access Paper or Ask Questions

The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?

Sep 03, 2024

Pedro Ramoneda, Emilia Parada-Cabaleiro, Benno Weck, Xavier Serra

Figure 1 for The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?

Figure 2 for The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?

Figure 3 for The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?

Figure 4 for The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?

Abstract:In this work, we explore the use and reliability of Large Language Models (LLMs) in musicology. From a discussion with experts and students, we assess the current acceptance and concerns regarding this, nowadays ubiquitous, technology. We aim to go one step further, proposing a semi-automatic method to create an initial benchmark using retrieval-augmented generation models and multiple-choice question generation, validated by human experts. Our evaluation on 400 human-validated questions shows that current vanilla LLMs are less reliable than retrieval augmented generation from music dictionaries. This paper suggests that the potential of LLMs in musicology requires musicology driven research that can specialized LLMs by including accurate and reliable domain knowledge.

Via

Access Paper or Ask Questions

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Aug 02, 2024

Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, Dmitry Bogdanov

Figure 1 for MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Figure 2 for MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Figure 3 for MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Figure 4 for MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Abstract:Multimodal models that jointly process audio and language hold great promise in audio understanding and are increasingly being adopted in the music domain. By allowing users to query via text and obtain information about a given audio input, these models have the potential to enable a variety of music understanding tasks via language-based interfaces. However, their evaluation poses considerable challenges, and it remains unclear how to effectively assess their ability to correctly interpret music-related inputs with current methods. Motivated by this, we introduce MuChoMusic, a benchmark for evaluating music understanding in multimodal language models focused on audio. MuChoMusic comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets, and covering a wide variety of genres. Questions in the benchmark are crafted to assess knowledge and reasoning abilities across several dimensions that cover fundamental musical concepts and their relation to cultural and functional contexts. Through the holistic analysis afforded by the benchmark, we evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality, pointing to a need for better multimodal integration. Data and code are open-sourced.

* Accepted at ISMIR 2024. Data: https://doi.org/10.5281/zenodo.12709974 Code: https://github.com/mulab-mir/muchomusic Supplementary material: https://mulab-mir.github.io/muchomusic

Via

Access Paper or Ask Questions

WikiMuTe: A web-sourced dataset of semantic descriptions for music audio

Dec 14, 2023

Benno Weck, Holger Kirchhoff, Peter Grosche, Xavier Serra

Abstract:Multi-modal deep learning techniques for matching free-form text with music have shown promising results in the field of Music Information Retrieval (MIR). Prior work is often based on large proprietary data while publicly available datasets are few and small in size. In this study, we present WikiMuTe, a new and open dataset containing rich semantic descriptions of music. The data is sourced from Wikipedia's rich catalogue of articles covering musical works. Using a dedicated text-mining pipeline, we extract both long and short-form descriptions covering a wide range of topics related to music content such as genre, style, mood, instrumentation, and tempo. To show the use of this data, we train a model that jointly learns text and audio representations and performs cross-modal retrieval. The model is evaluated on two tasks: tag-based music retrieval and music auto-tagging. The results show that while our approach has state-of-the-art performance on multiple tasks, but still observe a difference in performance depending on the data used for training.

* Submitted to 30th International Conference on MultiMedia Modeling (MMM2024). This preprint has not undergone peer review or any post-submission improvements or corrections

Via

Access Paper or Ask Questions

The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Nov 22, 2023

Ilaria Manco, Benno Weck, SeungHeon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos(+3 more)

Figure 1 for The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Figure 2 for The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Figure 3 for The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Figure 4 for The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Abstract:We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-and-language models. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-to-music generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance.

* Accepted to NeurIPS 2023 Workshop on Machine Learning for Audio

Via

Access Paper or Ask Questions

Data leakage in cross-modal retrieval training: A case study

Feb 23, 2023

Benno Weck, Xavier Serra

Abstract:The recent progress in text-based audio retrieval was largely propelled by the release of suitable datasets. Since the manual creation of such datasets is a laborious task, obtaining data from online resources can be a cheap solution to create large-scale datasets. We study the recently proposed SoundDesc benchmark dataset, which was automatically sourced from the BBC Sound Effects web page. In our analysis, we find that SoundDesc contains several duplicates that cause leakage of training data to the evaluation data. This data leakage ultimately leads to overly optimistic retrieval performance estimates in previous benchmarks. We propose new training, validation, and testing splits for the dataset that we make available online. To avoid weak contamination of the test data, we pool audio files that share similar recording setups. In our experiments, we find that the new splits serve as a more challenging benchmark.

* 5 pages. Accepted at ICASSP2023

Via

Access Paper or Ask Questions

Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval

Oct 06, 2022

Benno Weck, Miguel Pérez Fernández, Holger Kirchhoff, Xavier Serra

Figure 1 for Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval

Figure 2 for Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval

Figure 3 for Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval

Figure 4 for Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval

Abstract:We present an analysis of large-scale pretrained deep learning models used for cross-modal (text-to-audio) retrieval. We use embeddings extracted by these models in a metric learning framework to connect matching pairs of audio and text. Shallow neural networks map the embeddings to a common dimensionality. Our system, which is an extension of our submission to the Language-based Audio Retrieval Task of the DCASE Challenge 2022, employs the RoBERTa foundation model as the text embedding extractor. A pretrained PANNs model extracts the audio embeddings. To improve the generalisation of our model, we investigate how pretraining with audio and associated noisy text collected from the online platform Freesound improves the performance of our method. Furthermore, our ablation study reveals that the proper choice of the loss function and fine-tuning the pretrained models are essential in training a competitive retrieval system.

* 5 pages, 2 figures. Accepted at Detection and Classification of Acoustic Scenes and Events 2022 (DCASE2022)

Via

Access Paper or Ask Questions

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Oct 14, 2021

Benno Weck, Xavier Favory, Konstantinos Drossos, Xavier Serra

Figure 1 for Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Figure 2 for Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Figure 3 for Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Figure 4 for Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Abstract:Automated audio captioning (AAC) is the task of automatically generating textual descriptions for general audio signals. A captioning system has to identify various information from the input signal and express it with natural language. Existing works mainly focus on investigating new methods and try to improve their performance measured on existing datasets. Having attracted attention only recently, very few works on AAC study the performance of existing pre-trained audio and natural language processing resources. In this paper, we evaluate the performance of off-the-shelf models with a Transformer-based captioning approach. We utilize the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings. Our evaluation suggests that YAMNet combined with BERT embeddings produces the best captions. Moreover, in general, fine-tuning pre-trained word embeddings can lead to better performance. Finally, we show that sequences of audio embeddings can be processed using a Transformer encoder to produce higher-quality captions.

* 5 pages, 4 figures. Accepted at Detection and Classification of Acoustic Scenes and Events 2021 (DCASE2021)

Via

Access Paper or Ask Questions