Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Paul Primus

TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining

May 12, 2025

Paul Primus, Florian Schmid, Gerhard Widmer

Abstract:Learning to associate audio with textual descriptions is valuable for a range of tasks, including pretraining, zero-shot classification, audio retrieval, audio captioning, and text-conditioned audio generation. Existing contrastive language-audio pretrained models are typically trained using global, clip-level descriptions, which provide only weak temporal supervision. We hypothesize that CLAP-like language-audio models - particularly, if they are expected to produce frame-level embeddings - can benefit from a stronger temporal supervision. To confirm our hypothesis, we curate a novel dataset of approximately 12,000 audio recordings from Freesound, each annotated with single-sentence free-text descriptions linked to a specific temporal segment in an audio recording. We use large language models to clean these annotations by removing references to non-audible events, transcribed speech, typos, and annotator language bias. We further propose a frame-wise contrastive training strategy that learns to align text descriptions with temporal regions in an audio recording and demonstrate that our model has better temporal text-audio alignment abilities compared to models trained only on global captions when evaluated on the AudioSet Strong benchmark. The dataset and our source code are available on Zenodo and GitHub, respectively.

* submitted to the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025. Dataset (Zenodo): https://zenodo.org/records/15379789, Implementation (GitHub): https://github.com/OptimusPrimus/tacos

Via

Access Paper or Ask Questions

Effective Pre-Training of Audio Transformers for Sound Event Detection

Sep 14, 2024

Florian Schmid, Tobias Morocutti, Francesco Foscarin, Jan Schlüter, Paul Primus, Gerhard Widmer

Figure 1 for Effective Pre-Training of Audio Transformers for Sound Event Detection

Figure 2 for Effective Pre-Training of Audio Transformers for Sound Event Detection

Figure 3 for Effective Pre-Training of Audio Transformers for Sound Event Detection

Abstract:We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. This includes a balanced sampler, aggressive data augmentation, and ensemble knowledge distillation. For five transformers, we obtain a substantial performance improvement over previously available checkpoints both on AudioSet frame-level predictions and on frame-level sound event detection downstream tasks, confirming our pipeline's effectiveness. We publish the resulting checkpoints that researchers can directly fine-tune to build high-performance models for sound event detection tasks.

* Submitted to ICASSP'25. Source code available: https://github.com/fschmid56/PretrainedSED

Via

Access Paper or Ask Questions

Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

Aug 21, 2024

Paul Primus, Florian Schmid, Gerhard Widmer

Abstract:Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio-caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to create mismatching pairs by pairing the audio with a caption randomly drawn from the dataset. This is not ideal because the randomly sampled caption could, just by chance, partly or entirely describe the audio recording. However, correspondence information for all possible pairs is costly to annotate and thus typically unavailable; we, therefore, suggest substituting it with estimated correspondences. To this end, we propose a two-staged training procedure in which multiple retrieval models are first trained as usual, i.e., without estimated correspondences. In the second stage, the audio-caption correspondences predicted by these models then serve as prediction targets. We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting where a single model generates and then learns from the estimated correspondences. We further show that our method outperforms the current state of the art by 1.6 pp. mAP@10 on the ClothoV2 benchmark.

* In Proceedings of the 9th Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE, Tokyo, Japan, 2024. Implementation available on GitHub: https://github.com/OptimusPrimus/salsa

Via

Access Paper or Ask Questions

Improving Query-by-Vocal Imitation with Contrastive Learning and Audio Pretraining

Aug 21, 2024

Jonathan Greif, Florian Schmid, Paul Primus, Gerhard Widmer

Abstract:Query-by-Vocal Imitation (QBV) is about searching audio files within databases using vocal imitations created by the user's voice. Since most humans can effectively communicate sound concepts through voice, QBV offers the more intuitive and convenient approach compared to text-based search. To fully leverage QBV, developing robust audio feature representations for both the vocal imitation and the original sound is crucial. In this paper, we present a new system for QBV that utilizes the feature extraction capabilities of Convolutional Neural Networks pre-trained with large-scale general-purpose audio datasets. We integrate these pre-trained models into a dual encoder architecture and fine-tune them end-to-end using contrastive learning. A distinctive aspect of our proposed method is the fine-tuning strategy of pre-trained models using an adapted NT-Xent loss for contrastive learning, creating a shared embedding space for reference recordings and vocal imitations. The proposed system significantly enhances audio retrieval performance, establishing a new state of the art on both coarse- and fine-grained QBV tasks.

* Accepted to the DCASE Workshop 2024. Source code available: https://github.com/Jonathan-Greif/QBV

Via

Access Paper or Ask Questions

Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets

Jul 17, 2024

Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard Widmer

Abstract:A central problem in building effective sound event detection systems is the lack of high-quality, strongly annotated sound event datasets. For this reason, Task 4 of the DCASE 2024 challenge proposes learning from two heterogeneous datasets, including audio clips labeled with varying annotation granularity and with different sets of possible events. We propose a multi-iteration, multi-stage procedure for fine-tuning Audio Spectrogram Transformers on the joint DESED and MAESTRO Real datasets. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, achieving a new single-model, state-of-the-art performance on the public evaluation set of DESED with a PSDS1 of 0.692. A single model and an ensemble, both based on our proposed training procedure, ranked first in Task 4 of the DCASE Challenge 2024.

* Code: https://github.com/CPJKU/cpjku_dcase24

Via

Access Paper or Ask Questions

Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval

Jun 22, 2024

Paul Primus, Gerhard Widmer

Abstract:Matching raw audio signals with textual descriptions requires understanding the audio's content and the description's semantics and then drawing connections between the two modalities. This paper investigates a hybrid retrieval system that utilizes audio metadata as an additional clue to understand the content of audio signals before matching them with textual queries. We experimented with metadata often attached to audio recordings, such as keywords and natural-language descriptions, and we investigated late and mid-level fusion strategies to merge audio and metadata. Our hybrid approach with keyword metadata and late fusion improved the retrieval performance over a content-based baseline by 2.36 and 3.69 pp. mAP@10 on the ClothoV2 and AudioCaps benchmarks, respectively.

* EUSIPCO 2024

Via

Access Paper or Ask Questions

Data-Efficient Low-Complexity Acoustic Scene Classification in the DCASE 2024 Challenge

May 16, 2024

Florian Schmid, Paul Primus, Toni Heittola, Annamaria Mesaros, Irene Martín-Morató, Khaled Koutini, Gerhard Widmer

Figure 1 for Data-Efficient Low-Complexity Acoustic Scene Classification in the DCASE 2024 Challenge

Figure 2 for Data-Efficient Low-Complexity Acoustic Scene Classification in the DCASE 2024 Challenge

Abstract:This article describes the Data-Efficient Low-Complexity Acoustic Scene Classification Task in the DCASE 2024 Challenge and the corresponding baseline system. The task setup is a continuation of previous editions (2022 and 2023), which focused on recording device mismatches and low-complexity constraints. This year's edition introduces an additional real-world problem: participants must develop data-efficient systems for five scenarios, which progressively limit the available training data. The provided baseline system is based on an efficient, factorized CNN architecture constructed from inverted residual blocks and uses Freq-MixStyle to tackle the device mismatch problem. The baseline system's accuracy ranges from 42.40% on the smallest to 56.99% on the largest training set.

* Task Description Page: https://dcase.community/challenge2024/task-data-efficient-low-complexity-acoustic-scene-classification

Via

Access Paper or Ask Questions

Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets

Aug 08, 2023

Paul Primus, Khaled Koutini, Gerhard Widmer

Abstract:This work presents a text-to-audio-retrieval system based on pre-trained text and spectrogram transformers. Our method projects recordings and textual descriptions into a shared audio-caption space in which related examples from different modalities are close. Through a systematic analysis, we examine how each component of the system influences retrieval performance. As a result, we identify two key components that play a crucial role in driving performance: the self-attention-based audio encoder for audio embedding and the utilization of additional human-generated and synthetic data sets during pre-training. We further experimented with augmenting ClothoV2 captions with available keywords to increase their variety; however, this only led to marginal improvements. Our system ranked first in the 2023's DCASE Challenge, and it outperforms the current state of the art on the ClothoV2 benchmark by 5.6 pp. mAP@10.

* submitted to DCASE Workshop 2023

Via

Access Paper or Ask Questions

Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio & Text Augmentations

Aug 24, 2022

Paul Primus, Gerhard Widmer

Figure 1 for Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio & Text Augmentations

Figure 2 for Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio & Text Augmentations

Figure 3 for Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio & Text Augmentations

Figure 4 for Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio & Text Augmentations

Abstract:The absence of large labeled datasets remains a significant challenge in many application areas of deep learning. Researchers and practitioners typically resort to transfer learning and data augmentation to alleviate this issue. We study these strategies in the context of audio retrieval with natural language queries (Task 6b of the DCASE 2022 Challenge). Our proposed system uses pre-trained embedding models to project recordings and textual descriptions into a shared audio-caption space in which related examples from different modalities are close. We employ various data augmentation techniques on audio and text inputs and systematically tune their corresponding hyperparameters with sequential model-based optimization. Our results show that the used augmentations strategies reduce overfitting and improve retrieval performance. We further show that pre-training the system on the AudioCaps dataset leads to additional improvements.

* submitted to DCASE Workshop 2022

Via

Access Paper or Ask Questions

Improved Zero-Shot Audio Tagging & Classification with Patchout Spectrogram Transformers

Aug 24, 2022

Paul Primus, Gerhard Widmer

Figure 1 for Improved Zero-Shot Audio Tagging & Classification with Patchout Spectrogram Transformers

Figure 2 for Improved Zero-Shot Audio Tagging & Classification with Patchout Spectrogram Transformers

Figure 3 for Improved Zero-Shot Audio Tagging & Classification with Patchout Spectrogram Transformers

Figure 4 for Improved Zero-Shot Audio Tagging & Classification with Patchout Spectrogram Transformers

Abstract:Standard machine learning models for tagging and classifying acoustic signals cannot handle classes that were not seen during training. Zero-Shot (ZS) learning overcomes this restriction by predicting classes based on adaptable class descriptions. This study sets out to investigate the effectiveness of self-attention-based audio embedding architectures for ZS learning. To this end, we compare the very recent patchout spectrogram transformer with two classic convolutional architectures. We evaluate these three architectures on three tasks and on three different benchmark datasets: general-purpose tagging on AudioSet, environmental sound classification on ESC-50, and instrument tagging on OpenMIC. Our results show that the self-attention-based embedding methods outperform both compared convolutional architectures in all of these settings. By designing training and test data accordingly, we observe that prediction performance suffers significantly when the `semantic distance' between training and new test classes is large, an effect that will deserve more detailed investigations.

* published in EUSIPCO 2022

Via

Access Paper or Ask Questions