Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eklavya Sarkar

On feature representations for marmoset vocal communication analysis

Apr 21, 2025

Eklavya Sarkar, Kaja Wierucka, Alexandra B. Bosshard, Judith Burkart, Mathew Magimai. -Doss

Abstract:The acoustic analysis of marmoset (Callithrix jacchus) vocalizations is often used to understand the evolutionary origins of human language. Currently, the analysis is largely carried out in a manual or semi-manual manner. Thus, there is a need to develop automatic call analysis methods. In that direction, research has been limited to the development of analysis methods with small amounts of data or for specific scenarios. Furthermore, there is lack of prior knowledge about what type of information is relevant for different call analysis tasks. To address these issues, as a first step, this paper explores different feature representation methods, namely, HCTSA-based hand-crafted features Catch22, pre-trained self supervised learning (SSL) based features extracted from neural networks trained on human speech and end-to-end acoustic modeling for call-type classification, caller identification and caller sex identification. Through an investigation on three different marmoset call datasets, we demonstrate that SSL-based feature representations and end-to-end acoustic modeling tend to lead to better systems than Catch22 features for call-type and caller classification. Furthermore, we also highlight the impact of signal bandwidth on the obtained task performances.

* Bioacoustics Journal (2025) 1-15

Via

Access Paper or Ask Questions

Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing

Jan 10, 2025

Eklavya Sarkar, Mathew Magimai. -Doss

Abstract:Self-supervised learning (SSL) foundation models have emerged as powerful, domain-agnostic, general-purpose feature extractors applicable to a wide range of tasks. Such models pre-trained on human speech have demonstrated high transferability for bioacoustic processing. This paper investigates (i) whether SSL models pre-trained directly on animal vocalizations offer a significant advantage over those pre-trained on speech, and (ii) whether fine-tuning speech-pretrained models on automatic speech recognition (ASR) tasks can enhance bioacoustic classification. We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. Results indicate that pre-training on bioacoustic data provides only marginal improvements over speech-pretrained models, with comparable performance in most scenarios. Fine-tuning on ASR tasks yields mixed outcomes, suggesting that the general-purpose representations learned during SSL pre-training are already well-suited for bioacoustic tasks. These findings highlight the robustness of speech-pretrained SSL models for bioacoustics and imply that extensive fine-tuning may not be necessary for optimal performance.

* Accepted at ICASSP 2025

Via

Access Paper or Ask Questions

Feature Representations for Automatic Meerkat Vocalization Classification

Aug 27, 2024

Imen Ben Mahmoud, Eklavya Sarkar, Marta Manser, Mathew Magimai. -Doss

Abstract:Understanding evolution of vocal communication in social animals is an important research problem. In that context, beyond humans, there is an interest in analyzing vocalizations of other social animals such as, meerkats, marmosets, apes. While existing approaches address vocalizations of certain species, a reliable method tailored for meerkat calls is lacking. To that extent, this paper investigates feature representations for automatic meerkat vocalization analysis. Both traditional signal processing-based representations and data-driven representations facilitated by advances in deep learning are explored. Call type classification studies conducted on two data sets reveal that feature extraction methods developed for human speech processing can be effectively employed for automatic meerkat call analysis.

* Accepted at Interspeech 2024 satellite event (VIHAR 2024)

Via

Access Paper or Ask Questions

On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis

Jul 24, 2024

Eklavya Sarkar, Mathew Magimai. -Doss

Abstract:Marmoset monkeys encode vital information in their calls and serve as a surrogate model for neuro-biologists to understand the evolutionary origins of human vocal communication. Traditionally analyzed with signal processing-based features, recent approaches have utilized self-supervised models pre-trained on human speech for feature extraction, capitalizing on their ability to learn a signal's intrinsic structure independently of its acoustic domain. However, the utility of such foundation models remains unclear for marmoset call analysis in terms of multi-class classification, bandwidth, and pre-training domain. This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks. Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.

* Accepted at Interspeech 2024 satellite event (VIHAR 2024)

Via

Access Paper or Ask Questions

Can Self-Supervised Neural Networks Pre-Trained on Human Speech distinguish Animal Callers?

May 23, 2023

Eklavya Sarkar, Mathew Magimai. -Doss

Abstract:Self-supervised learning (SSL) models use only the intrinsic structure of a given signal, independent of its acoustic domain, to extract essential information from the input to an embedding space. This implies that the utility of such representations is not limited to modeling human speech alone. Building on this understanding, this paper explores the cross-transferability of SSL neural representations learned from human speech to analyze bio-acoustic signals. We conduct a caller discrimination analysis and a caller detection study on Marmoset vocalizations using eleven SSL models pre-trained with various pretext tasks. The results show that the embedding spaces carry meaningful caller information and can successfully distinguish the individual identities of Marmoset callers without fine-tuning. This demonstrates that representations pre-trained on human speech can be effectively applied to the bio-acoustics domain, providing valuable insights for future investigations in this field.

* Accepted at Interspeech 2023

Via

Access Paper or Ask Questions

Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering

Jun 27, 2022

Eklavya Sarkar, RaviShankar Prasad, Mathew Magimai. -Doss

Figure 1 for Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering

Figure 2 for Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering

Figure 3 for Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering

Figure 4 for Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering

Abstract:Voice activity detection (VAD) is an important pre-processing step for speech technology applications. The task consists of deriving segment boundaries of audio signals which contain voicing information. In recent years, it has been shown that voice source and vocal tract system information can be extracted using zero-frequency filtering (ZFF) without making any explicit model assumptions about the speech signal. This paper investigates the potential of zero-frequency filtering for jointly modeling voice source and vocal tract system information, and proposes two approaches for VAD. The first approach demarcates voiced regions using a composite signal composed of different zero-frequency filtered signals. The second approach feeds the composite signal as input to the rVAD algorithm. These approaches are compared with other supervised and unsupervised VAD methods in the literature, and are evaluated on the Aurora-2 database, across a range of SNRs (20 to -5 dB). Our studies show that the proposed ZFF-based methods perform comparable to state-of-art VAD methods and are more invariant to added degradation and different channel characteristics.

* Accepted at Interspeech 2022

Via

Access Paper or Ask Questions

Are GAN-based Morphs Threatening Face Recognition?

May 05, 2022

Eklavya Sarkar, Pavel Korshunov, Laurent Colbois, Sébastien Marcel

Figure 1 for Are GAN-based Morphs Threatening Face Recognition?

Figure 2 for Are GAN-based Morphs Threatening Face Recognition?

Figure 3 for Are GAN-based Morphs Threatening Face Recognition?

Abstract:Morphing attacks are a threat to biometric systems where the biometric reference in an identity document can be altered. This form of attack presents an important issue in applications relying on identity documents such as border security or access control. Research in generation of face morphs and their detection is developing rapidly, however very few datasets with morphing attacks and open-source detection toolkits are publicly available. This paper bridges this gap by providing two datasets and the corresponding code for four types of morphing attacks: two that rely on facial landmarks based on OpenCV and FaceMorpher, and two that use StyleGAN 2 to generate synthetic morphs. We also conduct extensive experiments to assess the vulnerability of four state-of-the-art face recognition systems, including FaceNet, VGG-Face, ArcFace, and ISV. Surprisingly, the experiments demonstrate that, although visually more appealing, morphs based on StyleGAN 2 do not pose a significant threat to the state to face recognition systems, as these morphs were outmatched by the simple morphs that are based facial landmarks.

* 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
* arXiv admin note: substantial text overlap with arXiv:2012.05344

Via

Access Paper or Ask Questions

Vulnerability Analysis of Face Morphing Attacks from Landmarks and Generative Adversarial Networks

Dec 09, 2020

Eklavya Sarkar, Pavel Korshunov, Laurent Colbois, Sébastien Marcel

Figure 1 for Vulnerability Analysis of Face Morphing Attacks from Landmarks and Generative Adversarial Networks

Figure 2 for Vulnerability Analysis of Face Morphing Attacks from Landmarks and Generative Adversarial Networks

Figure 3 for Vulnerability Analysis of Face Morphing Attacks from Landmarks and Generative Adversarial Networks

Figure 4 for Vulnerability Analysis of Face Morphing Attacks from Landmarks and Generative Adversarial Networks

Abstract:Morphing attacks is a threat to biometric systems where the biometric reference in an identity document can be altered. This form of attack presents an important issue in applications relying on identity documents such as border security or access control. Research in face morphing attack detection is developing rapidly, however very few datasets with several forms of attacks are publicly available. This paper bridges this gap by providing a new dataset with four different types of morphing attacks, based on OpenCV, FaceMorpher, WebMorph and a generative adversarial network (StyleGAN), generated with original face images from three public face datasets. We also conduct extensive experiments to assess the vulnerability of the state-of-the-art face recognition systems, notably FaceNet, VGG-Face, and ArcFace. The experiments demonstrate that VGG-Face, while being less accurate face recognition system compared to FaceNet, is also less vulnerable to morphing attacks. Also, we observed that na\"ive morphs generated with a StyleGAN do not pose a significant threat.

* Submitted to ICASSP 2021

Via

Access Paper or Ask Questions