Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dan Stowell

Clustering and novel class recognition: evaluating bioacoustic deep learning feature extractors

Apr 09, 2025

Vincent S. Kather, Burooj Ghani, Dan Stowell

Abstract:In computational bioacoustics, deep learning models are composed of feature extractors and classifiers. The feature extractors generate vector representations of the input sound segments, called embeddings, which can be input to a classifier. While benchmarking of classification scores provides insights into specific performance statistics, it is limited to species that are included in the models' training data. Furthermore, it makes it impossible to compare models trained on very different taxonomic groups. This paper aims to address this gap by analyzing the embeddings generated by the feature extractors of 15 bioacoustic models spanning a wide range of setups (model architectures, training data, training paradigms). We evaluated and compared different ways in which models structure embedding spaces through clustering and kNN classification, which allows us to focus our comparison on feature extractors independent of their classifiers. We believe that this approach lets us evaluate the adaptability and generalization potential of models going beyond the classes they were trained on.

* conference

Via

Access Paper or Ask Questions

InsectSet459: an open dataset of insect sounds for bioacoustic machine learning

Mar 19, 2025

Marius Faiß, Burooj Ghani, Dan Stowell

Abstract:Automatic recognition of insect sound could help us understand changing biodiversity trends around the world -- but insect sounds are challenging to recognize even for deep learning. We present a new dataset comprised of 26399 audio files, from 459 species of Orthoptera and Cicadidae. It is the first large-scale dataset of insect sound that is easily applicable for developing novel deep-learning methods. Its recordings were made with a variety of audio recorders using varying sample rates to capture the extremely broad range of frequencies that insects produce. We benchmark performance with two state-of-the-art deep learning classifiers, demonstrating good performance but also significant room for improvement in acoustic insect classification. This dataset can serve as a realistic test case for implementing insect monitoring workflows, and as a challenging basis for the development of audio representation methods that can handle highly variable frequencies and/or sample rates.

Via

Access Paper or Ask Questions

Enhanced Load Forecasting with GAT-LSTM: Leveraging Grid and Temporal Features

Feb 12, 2025

Ugochukwu Orji, Çiçek Güven, Dan Stowell

Abstract:Accurate power load forecasting is essential for the efficient operation and planning of electrical grids, particularly given the increased variability and complexity introduced by renewable energy sources. This paper introduces GAT-LSTM, a hybrid model that combines Graph Attention Networks (GAT) and Long Short-Term Memory (LSTM) networks. A key innovation of the model is the incorporation of edge attributes, such as line capacities and efficiencies, into the attention mechanism, enabling it to dynamically capture spatial relationships grounded in grid-specific physical and operational constraints. Additionally, by employing an early fusion of spatial graph embeddings and temporal sequence features, the model effectively learns and predicts complex interactions between spatial dependencies and temporal patterns, providing a realistic representation of the dynamics of power grids. Experimental evaluations on the Brazilian Electricity System dataset demonstrate that the GAT-LSTM model significantly outperforms state-of-the-art models, achieving reductions of 21. 8% in MAE, 15. 9% in RMSE and 20. 2% in MAPE. These results underscore the robustness and adaptability of the GAT-LSTM model, establishing it as a powerful tool for applications in grid management and energy planning.

Via

Access Paper or Ask Questions

LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging

Jan 07, 2025

Shubhr Singh, Emmanouil Benetos, Huy Phan, Dan Stowell

Figure 1 for LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging

Figure 2 for LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging

Figure 3 for LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging

Figure 4 for LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging

Abstract:Transformers have set new benchmarks in audio processing tasks, leveraging self-attention mechanisms to capture complex patterns and dependencies within audio data. However, their focus on pairwise interactions limits their ability to process the higher-order relations essential for identifying distinct audio objects. To address this limitation, this work introduces the Local- Higher Order Graph Neural Network (LHGNN), a graph based model that enhances feature understanding by integrating local neighbourhood information with higher-order data from Fuzzy C-Means clusters, thereby capturing a broader spectrum of audio relationships. Evaluation of the model on three publicly available audio datasets shows that it outperforms Transformer-based models across all benchmarks while operating with substantially fewer parameters. Moreover, LHGNN demonstrates a distinct advantage in scenarios lacking ImageNet pretraining, establishing its effectiveness and efficiency in environments where extensive pretraining data is unavailable.

Via

Access Paper or Ask Questions

Generalization in birdsong classification: impact of transfer learning methods and dataset characteristics

Sep 21, 2024

Burooj Ghani, Vincent J. Kalkman, Bob Planqué, Willem-Pier Vellinga, Lisa Gill, Dan Stowell

Figure 1 for Generalization in birdsong classification: impact of transfer learning methods and dataset characteristics

Figure 2 for Generalization in birdsong classification: impact of transfer learning methods and dataset characteristics

Figure 3 for Generalization in birdsong classification: impact of transfer learning methods and dataset characteristics

Figure 4 for Generalization in birdsong classification: impact of transfer learning methods and dataset characteristics

Abstract:Animal sounds can be recognised automatically by machine learning, and this has an important role to play in biodiversity monitoring. Yet despite increasingly impressive capabilities, bioacoustic species classifiers still exhibit imbalanced performance across species and habitats, especially in complex soundscapes. In this study, we explore the effectiveness of transfer learning in large-scale bird sound classification across various conditions, including single- and multi-label scenarios, and across different model architectures such as CNNs and Transformers. Our experiments demonstrate that both fine-tuning and knowledge distillation yield strong performance, with cross-distillation proving particularly effective in improving in-domain performance on Xeno-canto data. However, when generalizing to soundscapes, shallow fine-tuning exhibits superior performance compared to knowledge distillation, highlighting its robustness and constrained nature. Our study further investigates how to use multi-species labels, in cases where these are present but incomplete. We advocate for more comprehensive labeling practices within the animal sound community, including annotating background species and providing temporal details, to enhance the training of robust bird sound classifiers. These findings provide insights into the optimal reuse of pretrained models for advancing automatic bioacoustic recognition.

* 25 pages

Via

Access Paper or Ask Questions

Acoustic identification of individual animals with hierarchical contrastive learning

Sep 13, 2024

Ines Nolasco, Ilyass Moummad, Dan Stowell, Emmanouil Benetos

Figure 1 for Acoustic identification of individual animals with hierarchical contrastive learning

Figure 2 for Acoustic identification of individual animals with hierarchical contrastive learning

Figure 3 for Acoustic identification of individual animals with hierarchical contrastive learning

Figure 4 for Acoustic identification of individual animals with hierarchical contrastive learning

Abstract:Acoustic identification of individual animals (AIID) is closely related to audio-based species classification but requires a finer level of detail to distinguish between individual animals within the same species. In this work, we frame AIID as a hierarchical multi-label classification task and propose the use of hierarchy-aware loss functions to learn robust representations of individual identities that maintain the hierarchical relationships among species and taxa. Our results demonstrate that hierarchical embeddings not only enhance identification accuracy at the individual level but also at higher taxonomic levels, effectively preserving the hierarchical structure in the learned representations. By comparing our approach with non-hierarchical models, we highlight the advantage of enforcing this structure in the embedding space. Additionally, we extend the evaluation to the classification of novel individual classes, demonstrating the potential of our method in open-set classification scenarios.

* Under review; Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Jun 03, 2024

Julian C. Schäfer-Zimmermann, Vlad Demartsev, Baptiste Averly, Kiran Dhanjal-Adams, Mathieu Duteil, Gabriella Gall, Marius Faiß, Lily Johnson-Ulrich, Dan Stowell, Marta B. Manser(+2 more)

Figure 1 for animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Figure 2 for animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Figure 3 for animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Figure 4 for animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Abstract:Bioacoustic research provides invaluable insights into the behavior, ecology, and conservation of animals. Most bioacoustic datasets consist of long recordings where events of interest, such as vocalizations, are exceedingly rare. Analyzing these datasets poses a monumental challenge to researchers, where deep learning techniques have emerged as a standard method. Their adaptation remains challenging, focusing on models conceived for computer vision, where the audio waveforms are engineered into spectrographic representations for training and inference. We improve the current state of deep learning in bioacoustics in two ways: First, we present the animal2vec framework: a fully interpretable transformer model and self-supervised training scheme tailored for sparse and unbalanced bioacoustic data. Second, we openly publish MeerKAT: Meerkat Kalahari Audio Transcripts, a large-scale dataset containing audio collected via biologgers deployed on free-ranging meerkats with a length of over 1068h, of which 184h have twelve time-resolved vocalization-type classes, each with ms-resolution, making it the largest publicly-available labeled dataset on terrestrial mammals. Further, we benchmark animal2vec against the NIPS4Bplus birdsong dataset. We report new state-of-the-art results on both datasets and evaluate the few-shot capabilities of animal2vec of labeled training data. Finally, we perform ablation studies to highlight the differences between our architecture and a vanilla transformer baseline for human-produced sounds. animal2vec allows researchers to classify massive amounts of sparse bioacoustic data even with little ground truth information available. In addition, the MeerKAT dataset is the first large-scale, millisecond-resolution corpus for benchmarking bioacoustic models in the pretrain/finetune paradigm. We believe this sets the stage for a new reference point for bioacoustics.

* Code available at: https://github.com/livingingroups/animal2vec | Dataset available at: https://doi.org/10.17617/3.0J0DYB

Via

Access Paper or Ask Questions

Performance of computer vision algorithms for fine-grained classification using crowdsourced insect images

Apr 04, 2024

Rita Pucci, Vincent J. Kalkman, Dan Stowell

Figure 1 for Performance of computer vision algorithms for fine-grained classification using crowdsourced insect images

Figure 2 for Performance of computer vision algorithms for fine-grained classification using crowdsourced insect images

Figure 3 for Performance of computer vision algorithms for fine-grained classification using crowdsourced insect images

Figure 4 for Performance of computer vision algorithms for fine-grained classification using crowdsourced insect images

Abstract:With fine-grained classification, we identify unique characteristics to distinguish among classes of the same super-class. We are focusing on species recognition in Insecta, as they are critical for biodiversity monitoring and at the base of many ecosystems. With citizen science campaigns, billions of images are collected in the wild. Once these are labelled, experts can use them to create distribution maps. However, the labelling process is time-consuming, which is where computer vision comes in. The field of computer vision offers a wide range of algorithms, each with its strengths and weaknesses; how do we identify the algorithm that is in line with our application? To answer this question, we provide a full and detailed evaluation of nine algorithms among deep convolutional networks (CNN), vision transformers (ViT), and locality-based vision transformers (LBVT) on 4 different aspects: classification performance, embedding quality, computational cost, and gradient activity. We offer insights that we haven't yet had in this domain proving to which extent these algorithms solve the fine-grained tasks in Insecta. We found that the ViT performs the best on inference speed and computational cost while the LBVT outperforms the others on performance and embedding quality; the CNN provide a trade-off among the metrics.

Via

Access Paper or Ask Questions

Mind the Domain Gap: a Systematic Analysis on Bioacoustic Sound Event Detection

Mar 27, 2024

Jinhua Liang, Ines Nolasco, Burooj Ghani, Huy Phan, Emmanouil Benetos, Dan Stowell

Figure 1 for Mind the Domain Gap: a Systematic Analysis on Bioacoustic Sound Event Detection

Figure 2 for Mind the Domain Gap: a Systematic Analysis on Bioacoustic Sound Event Detection

Figure 3 for Mind the Domain Gap: a Systematic Analysis on Bioacoustic Sound Event Detection

Figure 4 for Mind the Domain Gap: a Systematic Analysis on Bioacoustic Sound Event Detection

Abstract:Detecting the presence of animal vocalisations in nature is essential to study animal populations and their behaviors. A recent development in the field is the introduction of the task known as few-shot bioacoustic sound event detection, which aims to train a versatile animal sound detector using only a small set of audio samples. Previous efforts in this area have utilized different architectures and data augmentation techniques to enhance model performance. However, these approaches have not fully bridged the domain gap between source and target distributions, limiting their applicability in real-world scenarios. In this work, we introduce an new dataset designed to augment the diversity and breadth of classes available for few-shot bioacoustic event detection, building on the foundations of our previous datasets. To establish a robust baseline system tailored for the DCASE 2024 Task 5 challenge, we delve into an array of acoustic features and adopt negative hard sampling as our primary domain adaptation strategy. This approach, chosen in alignment with the challenge's guidelines that necessitate the independent treatment of each audio file, sidesteps the use of transductive learning to ensure compliance while aiming to enhance the system's adaptability to domain shifts. Our experiments show that the proposed baseline system achieves a better performance compared with the vanilla prototypical network. The findings also confirm the effectiveness of each domain adaptation method by ablating different components within the networks. This highlights the potential to improve few-shot bioacoustic sound event detection by further reducing the impact of domain shift.

Via

Access Paper or Ask Questions

Efficient speech detection in environmental audio using acoustic recognition and knowledge distillation

Dec 14, 2023

Drew Priebe, Burooj Ghani, Dan Stowell

Figure 1 for Efficient speech detection in environmental audio using acoustic recognition and knowledge distillation

Figure 2 for Efficient speech detection in environmental audio using acoustic recognition and knowledge distillation

Figure 3 for Efficient speech detection in environmental audio using acoustic recognition and knowledge distillation

Abstract:The ongoing biodiversity crisis, driven by factors such as land-use change and global warming, emphasizes the need for effective ecological monitoring methods. Acoustic monitoring of biodiversity has emerged as an important monitoring tool. Detecting human voices in soundscape monitoring projects is useful both for analysing human disturbance and for privacy filtering. Despite significant strides in deep learning in recent years, the deployment of large neural networks on compact devices poses challenges due to memory and latency constraints. Our approach focuses on leveraging knowledge distillation techniques to design efficient, lightweight student models for speech detection in bioacoustics. In particular, we employed the MobileNetV3-Small-Pi model to create compact yet effective student architectures to compare against the larger EcoVAD teacher model, a well-regarded voice detection architecture in eco-acoustic monitoring. The comparative analysis included examining various configurations of the MobileNetV3-Small-Pi derived student models to identify optimal performance. Additionally, a thorough evaluation of different distillation techniques was conducted to ascertain the most effective method for model selection. Our findings revealed that the distilled models exhibited comparable performance to the EcoVAD teacher model, indicating a promising approach to overcoming computational barriers for real-time ecological monitoring.

Via

Access Paper or Ask Questions