Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matthew E. P. Davies

Tempo estimation as fully self-supervised binary classification

Jan 17, 2024

Florian Henkel, Jaehun Kim, Matthew C. McCallum, Samuel E. Sandberg, Matthew E. P. Davies

Abstract:This paper addresses the problem of global tempo estimation in musical audio. Given that annotating tempo is time-consuming and requires certain musical expertise, few publicly available data sources exist to train machine learning models for this task. Towards alleviating this issue, we propose a fully self-supervised approach that does not rely on any human labeled data. Our method builds on the fact that generic (music) audio embeddings already encode a variety of properties, including information about tempo, making them easily adaptable for downstream tasks. While recent work in self-supervised tempo estimation aimed to learn a tempo specific representation that was subsequently used to train a supervised classifier, we reformulate the task into the binary classification problem of predicting whether a target track has the same or a different tempo compared to a reference. While the former still requires labeled training data for the final classification model, our approach uses arbitrary unlabeled music data in combination with time-stretching for model training as well as a small set of synthetically created reference samples for predicting the final tempo. Evaluation of our approach in comparison with the state-of-the-art reveals highly competitive performance when the constraint of finding the precise tempo octave is relaxed.

* Accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

Via

Access Paper or Ask Questions

Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search

Jan 17, 2024

Matthew C. McCallum, Florian Henkel, Jaehun Kim, Samuel E. Sandberg, Matthew E. P. Davies

Figure 1 for Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search

Figure 2 for Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search

Figure 3 for Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search

Figure 4 for Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search

Abstract:Audio embeddings enable large scale comparisons of the similarity of audio files for applications such as search and recommendation. Due to the subjectivity of audio similarity, it can be desirable to design systems that answer not only whether audio is similar, but similar in what way (e.g., wrt. tempo, mood or genre). Previous works have proposed disentangled embedding spaces where subspaces representing specific, yet possibly correlated, attributes can be weighted to emphasize those attributes in downstream tasks. However, no research has been conducted into the independence of these subspaces, nor their manipulation, in order to retrieve tracks that are similar but different in a specific way. Here, we explore the manipulation of tempo in embedding spaces as a case-study towards this goal. We propose tempo translation functions that allow for efficient manipulation of tempo within a pre-existing embedding space whilst maintaining other properties such as genre. As this translation is specific to tempo it enables retrieval of tracks that are similar but have specifically different tempi. We show that such a function can be used as an efficient data augmentation strategy for both training of downstream tempo predictors, and improved nearest neighbor retrieval of properties largely independent of tempo.

* Accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

Via

Access Paper or Ask Questions

On the Effect of Data-Augmentation on Local Embedding Properties in the Contrastive Learning of Music Audio Representations

Jan 17, 2024

Matthew C. McCallum, Matthew E. P. Davies, Florian Henkel, Jaehun Kim, Samuel E. Sandberg

Figure 1 for On the Effect of Data-Augmentation on Local Embedding Properties in the Contrastive Learning of Music Audio Representations

Figure 2 for On the Effect of Data-Augmentation on Local Embedding Properties in the Contrastive Learning of Music Audio Representations

Figure 3 for On the Effect of Data-Augmentation on Local Embedding Properties in the Contrastive Learning of Music Audio Representations

Figure 4 for On the Effect of Data-Augmentation on Local Embedding Properties in the Contrastive Learning of Music Audio Representations

Abstract:Audio embeddings are crucial tools in understanding large catalogs of music. Typically embeddings are evaluated on the basis of the performance they provide in a wide range of downstream tasks, however few studies have investigated the local properties of the embedding spaces themselves which are important in nearest neighbor algorithms, commonly used in music search and recommendation. In this work we show that when learning audio representations on music datasets via contrastive learning, musical properties that are typically homogeneous within a track (e.g., key and tempo) are reflected in the locality of neighborhoods in the resulting embedding space. By applying appropriate data augmentation strategies, localisation of such properties can not only be reduced but the localisation of other attributes is increased. For example, locality of features such as pitch and tempo that are less relevant to non-expert listeners, may be mitigated while improving the locality of more salient features such as genre and mood, achieving state-of-the-art performance in nearest neighbor retrieval accuracy. Similarly, we show that the optimal selection of data augmentation strategies for contrastive learning of music audio embeddings is dependent on the downstream task, highlighting this as an important embedding design decision.

* Accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

Via

Access Paper or Ask Questions

Local Periodicity-Based Beat Tracking for Expressive Classical Piano Music

Aug 20, 2023

Ching-Yu Chiu, Meinard Müller, Matthew E. P. Davies, Alvin Wen-Yu Su, Yi-Hsuan Yang

Figure 1 for Local Periodicity-Based Beat Tracking for Expressive Classical Piano Music

Figure 2 for Local Periodicity-Based Beat Tracking for Expressive Classical Piano Music

Figure 3 for Local Periodicity-Based Beat Tracking for Expressive Classical Piano Music

Figure 4 for Local Periodicity-Based Beat Tracking for Expressive Classical Piano Music

Abstract:To model the periodicity of beats, state-of-the-art beat tracking systems use "post-processing trackers" (PPTs) that rely on several empirically determined global assumptions for tempo transition, which work well for music with a steady tempo. For expressive classical music, however, these assumptions can be too rigid. With two large datasets of Western classical piano music, namely the Aligned Scores and Performances (ASAP) dataset and a dataset of Chopin's Mazurkas (Maz-5), we report on experiments showing the failure of existing PPTs to cope with local tempo changes, thus calling for new methods. In this paper, we propose a new local periodicity-based PPT, called predominant local pulse-based dynamic programming (PLPDP) tracking, that allows for more flexible tempo transitions. Specifically, the new PPT incorporates a method called "predominant local pulses" (PLP) in combination with a dynamic programming (DP) component to jointly consider the locally detected periodicity and beat activation strength at each time instant. Accordingly, PLPDP accounts for the local periodicity, rather than relying on a global tempo assumption. Compared to existing PPTs, PLPDP particularly enhances the recall values at the cost of a lower precision, resulting in an overall improvement of F1-score for beat tracking in ASAP (from 0.473 to 0.493) and Maz-5 (from 0.595 to 0.838).

* Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing (July 2023)

Via

Access Paper or Ask Questions

Tempo vs. Pitch: understanding self-supervised tempo estimation

Apr 14, 2023

Giovana Morais, Matthew E. P. Davies, Marcelo Queiroz, Magdalena Fuentes

Figure 1 for Tempo vs. Pitch: understanding self-supervised tempo estimation

Figure 2 for Tempo vs. Pitch: understanding self-supervised tempo estimation

Figure 3 for Tempo vs. Pitch: understanding self-supervised tempo estimation

Abstract:Self-supervision methods learn representations by solving pretext tasks that do not require human-generated labels, alleviating the need for time-consuming annotations. These methods have been applied in computer vision, natural language processing, environmental sound analysis, and recently in music information retrieval, e.g. for pitch estimation. Particularly in the context of music, there are few insights about the fragility of these models regarding different distributions of data, and how they could be mitigated. In this paper, we explore these questions by dissecting a self-supervised model for pitch estimation adapted for tempo estimation via rigorous experimentation with synthetic data. Specifically, we study the relationship between the input representation and data distribution for self-supervised tempo estimation.

* 5 pages, 3 figures, published on 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing

Via

Access Paper or Ask Questions

An Analysis Method for Metric-Level Switching in Beat Tracking

Oct 13, 2022

Ching-Yu Chiu, Meinard Müller, Matthew E. P. Davies, Alvin Wen-Yu Su, Yi-Hsuan Yang

Figure 1 for An Analysis Method for Metric-Level Switching in Beat Tracking

Figure 2 for An Analysis Method for Metric-Level Switching in Beat Tracking

Figure 3 for An Analysis Method for Metric-Level Switching in Beat Tracking

Figure 4 for An Analysis Method for Metric-Level Switching in Beat Tracking

Abstract:For expressive music, the tempo may change over time, posing challenges to tracking the beats by an automatic model. The model may first tap to the correct tempo, but then may fail to adapt to a tempo change, or switch between several incorrect but perceptually plausible ones (e.g., half- or double-tempo). Existing evaluation metrics for beat tracking do not reflect such behaviors, as they typically assume a fixed relationship between the reference beats and estimated beats. In this paper, we propose a new performance analysis method, called annotation coverage ratio (ACR), that accounts for a variety of possible metric-level switching behaviors of beat trackers. The idea is to derive sequences of modified reference beats of all metrical levels for every two consecutive reference beats, and compare every sequence of modified reference beats to the subsequences of estimated beats. We show via experiments on three datasets of different genres the usefulness of ACR when utilized alongside existing metrics, and discuss the new insights to be gained.

* Accepted to IEEE Signal Processing Letters (Oct. 2022)

Via

Access Paper or Ask Questions

Symbolic music generation conditioned on continuous-valued emotions

Mar 30, 2022

Serkan Sulun, Matthew E. P. Davies, Paula Viana

Figure 1 for Symbolic music generation conditioned on continuous-valued emotions

Figure 2 for Symbolic music generation conditioned on continuous-valued emotions

Figure 3 for Symbolic music generation conditioned on continuous-valued emotions

Figure 4 for Symbolic music generation conditioned on continuous-valued emotions

Abstract:In this paper we present a new approach for the generation of multi-instrument symbolic music driven by musical emotion. The principal novelty of our approach centres on conditioning a state-of-the-art transformer based on continuous-valued valence and arousal labels. In addition, we provide a new large-scale dataset of symbolic music paired with emotion labels in terms of valence and arousal. We evaluate our approach in a quantitative manner in two ways, first by measuring its note prediction accuracy, and second via a regression task in the valence-arousal plane. Our results demonstrate that our proposed approaches outperform conditioning using control tokens which is representative of the current state of the art.

Via

Access Paper or Ask Questions

On Filter Generalization for Music Bandwidth Extension Using Deep Neural Networks

Nov 14, 2020

Serkan Sulun, Matthew E. P. Davies

Figure 1 for On Filter Generalization for Music Bandwidth Extension Using Deep Neural Networks

Figure 2 for On Filter Generalization for Music Bandwidth Extension Using Deep Neural Networks

Figure 3 for On Filter Generalization for Music Bandwidth Extension Using Deep Neural Networks

Figure 4 for On Filter Generalization for Music Bandwidth Extension Using Deep Neural Networks

Abstract:In this paper, we address a sub-topic of the broad domain of audio enhancement, namely musical audio bandwidth extension. We formulate the bandwidth extension problem using deep neural networks, where a band-limited signal is provided as input to the network, with the goal of reconstructing a full-bandwidth output. Our main contribution centers on the impact of the choice of low pass filter when training and subsequently testing the network. For two different state of the art deep architectures, ResNet and U-Net, we demonstrate that when the training and testing filters are matched, improvements in signal-to-noise ratio (SNR) of up to 7dB can be obtained. However, when these filters differ, the improvement falls considerably and under some training conditions results in a lower SNR than the band-limited input. To circumvent this apparent overfitting to filter shape, we propose a data augmentation strategy which utilizes multiple low pass filters during training and leads to improved generalization to unseen filtering conditions at test time.

* Qualitative examples on https://serkansulun.com/bwe

Via

Access Paper or Ask Questions