Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Meinard Müller

AUDIO LABS

Pitch Contour Exploration Across Audio Domains: A Vision-Based Transfer Learning Approach

Mar 24, 2025

Jakob Abeßer, Simon Schwär, Meinard Müller

Abstract:This study examines pitch contours as a unifying semantic construct prevalent across various audio domains including music, speech, bioacoustics, and everyday sounds. Analyzing pitch contours offers insights into the universal role of pitch in the perceptual processing of audio signals and contributes to a deeper understanding of auditory mechanisms in both humans and animals. Conventional pitch-tracking methods, while optimized for music and speech, face challenges in handling much broader frequency ranges and more rapid pitch variations found in other audio domains. This study introduces a vision-based approach to pitch contour analysis that eliminates the need for explicit pitch-tracking. The approach uses a convolutional neural network, pre-trained for object detection in natural images and fine-tuned with a dataset of synthetically generated pitch contours, to extract key contour parameters from the time-frequency representation of short audio segments. A diverse set of eight downstream tasks from four audio domains were selected to provide a challenging evaluation scenario for cross-domain pitch contour analysis. The results show that the proposed method consistently surpasses traditional techniques based on pitch-tracking on a wide range of tasks. This suggests that the vision-based approach establishes a foundation for comparative studies of pitch contour characteristics across diverse audio domains.

Via

Access Paper or Ask Questions

Model-Based Deep Learning for Music Information Research

Jun 17, 2024

Gael Richard, Vincent Lostanlen, Yi-Hsuan Yang, Meinard Müller

Abstract:In this article, we investigate the notion of model-based deep learning in the realm of music information research (MIR). Loosely speaking, we refer to the term model-based deep learning for approaches that combine traditional knowledge-based methods with data-driven techniques, especially those based on deep learning, within a diff erentiable computing framework. In music, prior knowledge for instance related to sound production, music perception or music composition theory can be incorporated into the design of neural networks and associated loss functions. We outline three specifi c scenarios to illustrate the application of model-based deep learning in MIR, demonstrating the implementation of such concepts and their potential.

* IEEE Signal Processing Magazine, In press

Via

Access Paper or Ask Questions

Performance Conditioning for Diffusion-Based Multi-Instrument Music Synthesis

Sep 21, 2023

Ben Maman, Johannes Zeitler, Meinard Müller, Amit H. Bermano

Abstract:Generating multi-instrument music from symbolic music representations is an important task in Music Information Retrieval (MIR). A central but still largely unsolved problem in this context is musically and acoustically informed control in the generation process. As the main contribution of this work, we propose enhancing control of multi-instrument synthesis by conditioning a generative model on a specific performance and recording environment, thus allowing for better guidance of timbre and style. Building on state-of-the-art diffusion-based music generative models, we introduce performance conditioning - a simple tool indicating the generative model to synthesize music with style and timbre of specific instruments taken from specific performances. Our prototype is evaluated using uncurated performances with diverse instrumentation and achieves state-of-the-art FAD realism scores while allowing novel timbre and style control. Our project page, including samples and demonstrations, is available at benadar293.github.io/midipm

* 5 pages, project page available at benadar293.github.io/midipm

Via

Access Paper or Ask Questions

Local Periodicity-Based Beat Tracking for Expressive Classical Piano Music

Aug 20, 2023

Ching-Yu Chiu, Meinard Müller, Matthew E. P. Davies, Alvin Wen-Yu Su, Yi-Hsuan Yang

Abstract:To model the periodicity of beats, state-of-the-art beat tracking systems use "post-processing trackers" (PPTs) that rely on several empirically determined global assumptions for tempo transition, which work well for music with a steady tempo. For expressive classical music, however, these assumptions can be too rigid. With two large datasets of Western classical piano music, namely the Aligned Scores and Performances (ASAP) dataset and a dataset of Chopin's Mazurkas (Maz-5), we report on experiments showing the failure of existing PPTs to cope with local tempo changes, thus calling for new methods. In this paper, we propose a new local periodicity-based PPT, called predominant local pulse-based dynamic programming (PLPDP) tracking, that allows for more flexible tempo transitions. Specifically, the new PPT incorporates a method called "predominant local pulses" (PLP) in combination with a dynamic programming (DP) component to jointly consider the locally detected periodicity and beat activation strength at each time instant. Accordingly, PLPDP accounts for the local periodicity, rather than relying on a global tempo assumption. Compared to existing PPTs, PLPDP particularly enhances the recall values at the cost of a lower precision, resulting in an overall improvement of F1-score for beat tracking in ASAP (from 0.473 to 0.493) and Maz-5 (from 0.595 to 0.838).

* Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing (July 2023)

Via

Access Paper or Ask Questions

Stabilizing Training with Soft Dynamic Time Warping: A Case Study for Pitch Class Estimation with Weakly Aligned Targets

Aug 10, 2023

Johannes Zeitler, Simon Deniffel, Michael Krause, Meinard Müller

Figure 1 for Stabilizing Training with Soft Dynamic Time Warping: A Case Study for Pitch Class Estimation with Weakly Aligned Targets

Figure 2 for Stabilizing Training with Soft Dynamic Time Warping: A Case Study for Pitch Class Estimation with Weakly Aligned Targets

Figure 3 for Stabilizing Training with Soft Dynamic Time Warping: A Case Study for Pitch Class Estimation with Weakly Aligned Targets

Figure 4 for Stabilizing Training with Soft Dynamic Time Warping: A Case Study for Pitch Class Estimation with Weakly Aligned Targets

Abstract:Soft dynamic time warping (SDTW) is a differentiable loss function that allows for training neural networks from weakly aligned data. Typically, SDTW is used to iteratively compute and refine soft alignments that compensate for temporal deviations between the training data and its weakly annotated targets. One major problem is that a mismatch between the estimated soft alignments and the reference alignments in the early training stage leads to incorrect parameter updates, making the overall training procedure unstable. In this paper, we investigate such stability issues by considering the task of pitch class estimation from music recordings as an illustrative case study. In particular, we introduce and discuss three conceptually different strategies (a hyperparameter scheduling, a diagonal prior, and a sequence unfolding strategy) with the objective of stabilizing intermediate soft alignment results. Finally, we report on experiments that demonstrate the effectiveness of the strategies and discuss efficiency and implementation issues.

* Accepted for ISMIR 2023, Milano, Italy

Via

Access Paper or Ask Questions

Soft Dynamic Time Warping for Multi-Pitch Estimation and Beyond

Apr 11, 2023

Michael Krause, Christof Weiß, Meinard Müller

Figure 1 for Soft Dynamic Time Warping for Multi-Pitch Estimation and Beyond

Figure 2 for Soft Dynamic Time Warping for Multi-Pitch Estimation and Beyond

Figure 3 for Soft Dynamic Time Warping for Multi-Pitch Estimation and Beyond

Figure 4 for Soft Dynamic Time Warping for Multi-Pitch Estimation and Beyond

Abstract:Many tasks in music information retrieval (MIR) involve weakly aligned data, where exact temporal correspondences are unknown. The connectionist temporal classification (CTC) loss is a standard technique to learn feature representations based on weakly aligned training data. However, CTC is limited to discrete-valued target sequences and can be difficult to extend to multi-label problems. In this article, we show how soft dynamic time warping (SoftDTW), a differentiable variant of classical DTW, can be used as an alternative to CTC. Using multi-pitch estimation as an example scenario, we show that SoftDTW yields results on par with a state-of-the-art multi-label extension of CTC. In addition to being more elegant in terms of its algorithmic formulation, SoftDTW naturally extends to real-valued target sequences.

* Accepted at ICASSP 2023

Via

Access Paper or Ask Questions

An Analysis Method for Metric-Level Switching in Beat Tracking

Oct 13, 2022

Ching-Yu Chiu, Meinard Müller, Matthew E. P. Davies, Alvin Wen-Yu Su, Yi-Hsuan Yang

Figure 1 for An Analysis Method for Metric-Level Switching in Beat Tracking

Figure 2 for An Analysis Method for Metric-Level Switching in Beat Tracking

Figure 3 for An Analysis Method for Metric-Level Switching in Beat Tracking

Figure 4 for An Analysis Method for Metric-Level Switching in Beat Tracking

Abstract:For expressive music, the tempo may change over time, posing challenges to tracking the beats by an automatic model. The model may first tap to the correct tempo, but then may fail to adapt to a tempo change, or switch between several incorrect but perceptually plausible ones (e.g., half- or double-tempo). Existing evaluation metrics for beat tracking do not reflect such behaviors, as they typically assume a fixed relationship between the reference beats and estimated beats. In this paper, we propose a new performance analysis method, called annotation coverage ratio (ACR), that accounts for a variety of possible metric-level switching behaviors of beat trackers. The idea is to derive sequences of modified reference beats of all metrical levels for every two consecutive reference beats, and compare every sequence of modified reference beats to the subsequences of estimated beats. We show via experiments on three datasets of different genres the usefulness of ACR when utilized alongside existing metrics, and discuss the new insights to be gained.

* Accepted to IEEE Signal Processing Letters (Oct. 2022)

Via

Access Paper or Ask Questions

Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer

Nov 07, 2021

Yi-Jen Shih, Shih-Lun Wu, Frank Zalkow, Meinard Müller, Yi-Hsuan Yang

Figure 1 for Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer

Figure 2 for Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer

Figure 3 for Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer

Figure 4 for Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer

Abstract:Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with a user-specified sequence, a popular approach is to take that conditioning sequence as a priming sequence and ask a Transformer decoder to generate a continuation. However, this prompt-based conditioning cannot guarantee that the conditioning sequence would develop or even simply repeat itself in the generated continuation. In this paper, we propose an alternative conditioning approach, called theme-based conditioning, that explicitly trains the Transformer to treat the conditioning sequence as a thematic material that has to manifest itself multiple times in its generation result. This is achieved with two main technical contributions. First, we propose a deep learning-based approach that uses contrastive representation learning and clustering to automatically retrieve thematic materials from music pieces in the training data. Second, we propose a novel gated parallel attention module to be used in a sequence-to-sequence (seq2seq) encoder/decoder architecture to more effectively account for a given conditioning thematic material in the generation process of the Transformer decoder. We report on objective and subjective evaluations of variants of the proposed Theme Transformer and the conventional prompt-based baseline, showing that our best model can generate, to some extent, polyphonic pop piano music with repetition and plausible variations of a given condition.

Via

Access Paper or Ask Questions

Towards Audio Domain Adaptation for Acoustic Scene Classification using Disentanglement Learning

Oct 26, 2021

Jakob Abeßer, Meinard Müller

Figure 1 for Towards Audio Domain Adaptation for Acoustic Scene Classification using Disentanglement Learning

Figure 2 for Towards Audio Domain Adaptation for Acoustic Scene Classification using Disentanglement Learning

Figure 3 for Towards Audio Domain Adaptation for Acoustic Scene Classification using Disentanglement Learning

Abstract:The deployment of machine listening algorithms in real-life applications is often impeded by a domain shift caused for instance by different microphone characteristics. In this paper, we propose a novel domain adaptation strategy based on disentanglement learning. The goal is to disentangle task-specific and domain-specific characteristics in the analyzed audio recordings. In particular, we combine two strategies: First, we apply different binary masks to internal embedding representations and, second, we suggest a novel combination of categorical cross-entropy and variance-based losses. Our results confirm the disentanglement of both tasks on an embedding level but show only minor improvement in the acoustic scene classification performance, when training data from both domains can be used. As a second finding, we can confirm the effectiveness of a state-of-the-art unsupervised domain adaptation strategy, which performs across-domain adaptation on a feature-level instead.

* submitted to ICASSP 2022

Via

Access Paper or Ask Questions

Unsupervised Domain Adaptation for Acoustic Scene Classification Using Band-Wise Statistics Matching

Apr 30, 2020

Alessandro Ilic Mezza, Emanuël A. P. Habets, Meinard Müller, Augusto Sarti

Figure 1 for Unsupervised Domain Adaptation for Acoustic Scene Classification Using Band-Wise Statistics Matching

Figure 2 for Unsupervised Domain Adaptation for Acoustic Scene Classification Using Band-Wise Statistics Matching

Figure 3 for Unsupervised Domain Adaptation for Acoustic Scene Classification Using Band-Wise Statistics Matching

Abstract:The performance of machine learning algorithms is known to be negatively affected by possible mismatches between training (source) and test (target) data distributions. In fact, this problem emerges whenever an acoustic scene classification system which has been trained on data recorded by a given device is applied to samples acquired under different acoustic conditions or captured by mismatched recording devices. To address this issue, we propose an unsupervised domain adaptation method that consists of aligning the first- and second-order sample statistics of each frequency band of target-domain acoustic scenes to the ones of the source-domain training dataset. This model-agnostic approach is devised to adapt audio samples from unseen devices before they are fed to a pre-trained classifier, thus avoiding any further learning phase. Using the DCASE 2018 Task 1-B development dataset, we show that the proposed method outperforms the state-of-the-art unsupervised methods found in the literature in terms of both source- and target-domain classification accuracy.

* 5 pages, 1 figure, 3 tables, submitted to EUSIPCO 2020

Via

Access Paper or Ask Questions