Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dongxing Xu

Exploring the Potential of SSL Models for Sound Event Detection

May 17, 2025

Hanfang Cui, Longfei Song, Li Li, Dongxing Xu, Yanhua Long

Abstract:Self-supervised learning (SSL) models offer powerful representations for sound event detection (SED), yet their synergistic potential remains underexplored. This study systematically evaluates state-of-the-art SSL models to guide optimal model selection and integration for SED. We propose a framework that combines heterogeneous SSL representations (e.g., BEATs, HuBERT, WavLM) through three fusion strategies: individual SSL embedding integration, dual-modal fusion, and full aggregation. Experiments on the DCASE 2023 Task 4 Challenge reveal that dual-modal fusion (e.g., CRNN+BEATs+WavLM) achieves complementary performance gains, while CRNN+BEATs alone delivers the best results among individual SSL models. We further introduce normalized sound event bounding boxes (nSEBBs), an adaptive post-processing method that dynamically adjusts event boundary predictions, improving PSDS1 by up to 4% for standalone SSL models. These findings highlight the compatibility and complementarity of SSL architectures, providing guidance for task-specific fusion and robust SED system design.

* 27 pages, 5 figures, submitted to the Journal of King Saud University - Computer and Information Sciences (under review)

Via

Access Paper or Ask Questions

ICSD: An Open-source Dataset for Infant Cry and Snoring Detection

Aug 20, 2024

Qingyu Liu, Longfei Song, Dongxing Xu, Yanhua Long

Figure 1 for ICSD: An Open-source Dataset for Infant Cry and Snoring Detection

Figure 2 for ICSD: An Open-source Dataset for Infant Cry and Snoring Detection

Figure 3 for ICSD: An Open-source Dataset for Infant Cry and Snoring Detection

Figure 4 for ICSD: An Open-source Dataset for Infant Cry and Snoring Detection

Abstract:The detection and analysis of infant cry and snoring events are crucial tasks within the field of audio signal processing. While existing datasets for general sound event detection are plentiful, they often fall short in providing sufficient, strongly labeled data specific to infant cries and snoring. To provide a benchmark dataset and thus foster the research of infant cry and snoring detection, this paper introduces the Infant Cry and Snoring Detection (ICSD) dataset, a novel, publicly available dataset specially designed for ICSD tasks. The ICSD comprises three types of subsets: a real strongly labeled subset with event-based labels annotated manually, a weakly labeled subset with only clip-level event annotations, and a synthetic subset generated and labeled with strong annotations. This paper provides a detailed description of the ICSD creation process, including the challenges encountered and the solutions adopted. We offer a comprehensive characterization of the dataset, discussing its limitations and key factors for ICSD usage. Additionally, we conduct extensive experiments on the ICSD dataset to establish baseline systems and offer insights into the main factors when using this dataset for ICSD research. Our goal is to develop a dataset that will be widely adopted by the community as a new open benchmark for future ICSD research.

* 11 pages, 6 figures

Via

Access Paper or Ask Questions

Autoencoder with Group-based Decoder and Multi-task Optimization for Anomalous Sound Detection

Nov 15, 2023

Yifan Zhou, Dongxing Xu, Haoran Wei, Yanhua Long

Figure 1 for Autoencoder with Group-based Decoder and Multi-task Optimization for Anomalous Sound Detection

Figure 2 for Autoencoder with Group-based Decoder and Multi-task Optimization for Anomalous Sound Detection

Figure 3 for Autoencoder with Group-based Decoder and Multi-task Optimization for Anomalous Sound Detection

Abstract:In industry, machine anomalous sound detection (ASD) is in great demand. However, collecting enough abnormal samples is difficult due to the high cost, which boosts the rapid development of unsupervised ASD algorithms. Autoencoder (AE) based methods have been widely used for unsupervised ASD, but suffer from problems including 'shortcut', poor anti-noise ability and sub-optimal quality of features. To address these challenges, we propose a new AE-based framework termed AEGM. Specifically, we first insert an auxiliary classifier into AE to enhance ASD in a multi-task learning manner. Then, we design a group-based decoder structure, accompanied by an adaptive loss function, to endow the model with domain-specific knowledge. Results on the DCASE 2021 Task 2 development set show that our methods achieve a relative improvement of 13.11% and 15.20% respectively in average AUC over the official AE and MobileNetV2 across test sets of seven machines.

* Submitted to the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)

Via

Access Paper or Ask Questions

UNISOUND System for VoxCeleb Speaker Recognition Challenge 2023

Aug 24, 2023

Yu Zheng, Yajun Zhang, Chuanying Niu, Yibin Zhan, Yanhua Long, Dongxing Xu

Figure 1 for UNISOUND System for VoxCeleb Speaker Recognition Challenge 2023

Figure 2 for UNISOUND System for VoxCeleb Speaker Recognition Challenge 2023

Figure 3 for UNISOUND System for VoxCeleb Speaker Recognition Challenge 2023

Figure 4 for UNISOUND System for VoxCeleb Speaker Recognition Challenge 2023

Abstract:This report describes the UNISOUND submission for Track1 and Track2 of VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC 2023). We submit the same system on Track 1 and Track 2, which is trained with only VoxCeleb2-dev. Large-scale ResNet and RepVGG architectures are developed for the challenge. We propose a consistency-aware score calibration method, which leverages the stability of audio voiceprints in similarity score by a Consistency Measure Factor (CMF). CMF brings a huge performance boost in this challenge. Our final system is a fusion of six models and achieves the first place in Track 1 and second place in Track 2 of VoxSRC 2023. The minDCF of our submission is 0.0855 and the EER is 1.5880%.

Via

Access Paper or Ask Questions

Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR system

Nov 03, 2022

Li Li, Dongxing Xu, Haoran Wei, Yanhua Long

Figure 1 for Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR system

Figure 2 for Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR system

Figure 3 for Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR system

Figure 4 for Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR system

Abstract:Exploiting effective target modeling units is very important and has always been a concern in end-to-end automatic speech recognition (ASR). In this work, we propose a phonetic-assisted multi-target units (PMU) modeling approach, to enhance the Conformer-Transducer ASR system in a progressive representation learning manner. Specifically, PMU first uses the pronunciation-assisted subword modeling (PASM) and byte pair encoding (BPE) to produce phonetic-induced and text-induced target units separately; Then, three new frameworks are investigated to enhance the acoustic encoder, including a basic PMU, a paraCTC and a pcaCTC, they integrate the PASM and BPE units at different levels for CTC and transducer multi-task training. Experiments on both LibriSpeech and accented ASR tasks show that, the proposed PMU significantly outperforms the conventional BPE, it reduces the WER of LibriSpeech clean, other, and six accented ASR testsets by relative 12.7%, 6.0% and 7.7%, respectively.

* 5 pages, 1 figures, submitted to ICASSP 2023

Via

Access Paper or Ask Questions