Abstract:We introduce PANAMA, an active learning framework for the training of end-to-end parametric guitar amp models using a WaveNet-like architecture. With \model, one can create a virtual amp by recording samples that are determined by an active learning strategy to use a minimum amount of datapoints (i.e., amp knob settings). We show that gradient-based optimization algorithms can be used to determine the optimal datapoints to sample, and that the approach helps under a constrained number of samples.
Abstract:Recent advancements have brought generated music closer to human-created compositions, yet evaluating these models remains challenging. While human preference is the gold standard for assessing quality, translating these subjective judgments into objective metrics, particularly for text-audio alignment and music quality, has proven difficult. In this work, we generate 6k songs using 12 state-of-the-art models and conduct a survey of 15k pairwise audio comparisons with 2.5k human participants to evaluate the correlation between human preferences and widely used metrics. To the best of our knowledge, this work is the first to rank current state-of-the-art music generation models and metrics based on human preference. To further the field of subjective metric evaluation, we provide open access to our dataset of generated music and human evaluations.
Abstract:While neural vocoders have made significant progress in high-fidelity speech synthesis, their application on polyphonic music has remained underexplored. In this work, we propose DisCoder, a neural vocoder that leverages a generative adversarial encoder-decoder architecture informed by a neural audio codec to reconstruct high-fidelity 44.1 kHz audio from mel spectrograms. Our approach first transforms the mel spectrogram into a lower-dimensional representation aligned with the Descript Audio Codec (DAC) latent space before reconstructing it to an audio signal using a fine-tuned DAC decoder. DisCoder achieves state-of-the-art performance in music synthesis on several objective metrics and in a MUSHRA listening study. Our approach also shows competitive performance in speech synthesis, highlighting its potential as a universal vocoder.
Abstract:We introduce Audio Atlas, an interactive web application for visualizing audio data using text-audio embeddings. Audio Atlas is designed to facilitate the exploration and analysis of audio datasets using a contrastive embedding model and a vector database for efficient data management and semantic search. The system maps audio embeddings into a two-dimensional space and leverages DeepScatter for dynamic visualization. Designed for extensibility, Audio Atlas allows easy integration of new datasets, enabling users to better understand their audio data and identify both patterns and outliers. We open-source the codebase of Audio Atlas, and provide an initial implementation containing various audio and music datasets.
Abstract:Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use language modeling approaches for audio generation and understanding. Residual Vector Quantization (RVQ) has become the standard technique for neural audio compression using a cascade of VQ codebooks. This paper proposes the Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions. By applying a hierarchy of quantizers at variable frame rates, the codec adapts to the audio structure across multiple timescales. This leads to more efficient compression, as demonstrated by extensive objective and subjective evaluations. The code and model weights are open-sourced at https://github.com/hubertsiuzdak/snac.
Abstract:Music recommender systems frequently utilize network-based models to capture relationships between music pieces, artists, and users. Although these relationships provide valuable insights for predictions, new music pieces or artists often face the cold-start problem due to insufficient initial information. To address this, one can extract content-based information directly from the music to enhance collaborative-filtering-based methods. While previous approaches have relied on hand-crafted audio features for this purpose, we explore the use of contrastively pretrained neural audio embedding models, which offer a richer and more nuanced representation of music. Our experiments demonstrate that neural embeddings, particularly those generated with the Contrastive Language-Audio Pretraining (CLAP) model, present a promising approach to enhancing music recommendation tasks within graph-based frameworks.
Abstract:Image datasets serve as the foundation for machine learning models in computer vision, significantly influencing model capabilities, performance, and biases alongside architectural considerations. Therefore, understanding the composition and distribution of these datasets has become increasingly crucial. To address the need for intuitive exploration of these datasets, we propose AEye, an extensible and scalable visualization tool tailored to image datasets. AEye utilizes a contrastively trained model to embed images into semantically meaningful high-dimensional representations, facilitating data clustering and organization. To visualize the high-dimensional representations, we project them onto a two-dimensional plane and arrange images in layers so users can seamlessly navigate and explore them interactively. AEye facilitates semantic search functionalities for both text and image queries, enabling users to search for content. We open-source the codebase for AEye, and provide a simple configuration to add datasets.
Abstract:Cue points indicate possible temporal boundaries in a transition between two pieces of music in DJ mixing and constitute a crucial element in autonomous DJ systems as well as for live mixing. In this work, we present a novel method for automatic cue point estimation, interpreted as a computer vision object detection task. Our proposed system is based on a pre-trained object detection transformer which we fine-tune on our novel cue point dataset. Our provided dataset contains 21k manually annotated cue points from human experts as well as metronome information for nearly 5k individual tracks, making this dataset 35x larger than the previously available cue point dataset. Unlike previous methods, our approach does not require low-level musical information analysis, while demonstrating increased precision in retrieving cue point positions. Moreover, our proposed method demonstrates high adherence to phrasing, a type of high-level music structure commonly emphasized in electronic dance music. The code, model checkpoints, and dataset are made publicly available.
Abstract:Algorithmic reasoning is a fundamental cognitive ability that plays a pivotal role in problem-solving and decision-making processes. Reinforcement Learning (RL) has demonstrated remarkable proficiency in tasks such as motor control, handling perceptual input, and managing stochastic environments. These advancements have been enabled in part by the availability of benchmarks. In this work we introduce PUZZLES, a benchmark based on Simon Tatham's Portable Puzzle Collection, aimed at fostering progress in algorithmic and logical reasoning in RL. PUZZLES contains 40 diverse logic puzzles of adjustable sizes and varying levels of complexity; many puzzles also feature a diverse set of additional configuration parameters. The 40 puzzles provide detailed information on the strengths and generalization capabilities of RL agents. Furthermore, we evaluate various RL algorithms on PUZZLES, providing baseline comparisons and demonstrating the potential for future research. All the software, including the environment, is available at https://github.com/ETH-DISCO/rlp.
Abstract:Reinforcement learning (RL) has gained popularity in the realm of recommender systems due to its ability to optimize long-term rewards and guide users in discovering relevant content. However, the successful implementation of RL in recommender systems is challenging because of several factors, including the limited availability of online data for training on-policy methods. This scarcity requires expensive human interaction for online model training. Furthermore, the development of effective evaluation frameworks that accurately reflect the quality of models remains a fundamental challenge in recommender systems. To address these challenges, we propose a comprehensive framework for synthetic environments that simulate human behavior by harnessing the capabilities of large language models (LLMs). We complement our framework with in-depth ablation studies and demonstrate its effectiveness with experiments on movie and book recommendations. By utilizing LLMs as synthetic users, this work introduces a modular and novel framework for training RL-based recommender systems. The software, including the RL environment, is publicly available.