Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Han Yin

Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event Detection

Nov 02, 2024

Han Yin, Yang Xiao, Jisheng Bai, Rohan Kumar Das

Abstract:Sound Event Detection (SED) is challenging in noisy environments where overlapping sounds obscure target events. Language-queried audio source separation (LASS) aims to isolate the target sound events from a noisy clip. However, this approach can fail when the exact target sound is unknown, particularly in noisy test sets, leading to reduced performance. To address this issue, we leverage the capabilities of large language models (LLMs) to analyze and summarize acoustic data. By using LLMs to identify and select specific noise types, we implement a noise augmentation method for noise-robust fine-tuning. The fine-tuned model is applied to predict clip-wise event predictions as text queries for the LASS model. Our studies demonstrate that the proposed method improves SED performance in noisy environments. This work represents an early application of LLMs in noise-robust SED and suggests a promising direction for handling overlapping events in SED. Codes and pretrained models are available at https://github.com/apple-yinhan/Noise-robust-SED.

* Submitted to ICASSP 2025 Workshop

Via

Access Paper or Ask Questions

Multimodal Clickbait Detection by De-confounding Biases Using Causal Representation Inference

Oct 10, 2024

Jianxing Yu, Shiqi Wang, Han Yin, Zhenlong Sun, Ruobing Xie, Bo Zhang, Yanghui Rao

Figure 1 for Multimodal Clickbait Detection by De-confounding Biases Using Causal Representation Inference

Figure 2 for Multimodal Clickbait Detection by De-confounding Biases Using Causal Representation Inference

Figure 3 for Multimodal Clickbait Detection by De-confounding Biases Using Causal Representation Inference

Figure 4 for Multimodal Clickbait Detection by De-confounding Biases Using Causal Representation Inference

Abstract:This paper focuses on detecting clickbait posts on the Web. These posts often use eye-catching disinformation in mixed modalities to mislead users to click for profit. That affects the user experience and thus would be blocked by content provider. To escape detection, malicious creators use tricks to add some irrelevant non-bait content into bait posts, dressing them up as legal to fool the detector. This content often has biased relations with non-bait labels, yet traditional detectors tend to make predictions based on simple co-occurrence rather than grasping inherent factors that lead to malicious behavior. This spurious bias would easily cause misjudgments. To address this problem, we propose a new debiased method based on causal inference. We first employ a set of features in multiple modalities to characterize the posts. Considering these features are often mixed up with unknown biases, we then disentangle three kinds of latent factors from them, including the invariant factor that indicates intrinsic bait intention; the causal factor which reflects deceptive patterns in a certain scenario, and non-causal noise. By eliminating the noise that causes bias, we can use invariant and causal factors to build a robust model with good generalization ability. Experiments on three popular datasets show the effectiveness of our approach.

Via

Access Paper or Ask Questions

Mixstyle based Domain Generalization for Sound Event Detection with Heterogeneous Training Data

Jul 04, 2024

Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

Abstract:This work explores domain generalization (DG) for sound event detection (SED), advancing adaptability towards real-world scenarios. Our approach employs a mean-teacher framework with domain generalization to integrate heterogeneous training data, while preserving the SED model performance across the datasets. Specifically, we first apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Next, we use the adaptive residual normalization method to generalize features across multiple domains by applying instance normalization in the frequency dimension. Lastly, we use the sound event bounding boxes method for post-processing. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We evaluate the proposed approach on DCASE 2024 Challenge Task 4 dataset, measuring polyphonic SED score (PSDS) on the DESED dataset and macro-average pAUC on the MAESTRO dataset. The results indicate that the proposed DG-based method improves both PSDS and macro-average pAUC compared to the challenge baseline.

* Sumbitted to DCASE WS 2024. 5 pages. arXiv admin note: text overlap with arXiv:2407.00291

Via

Access Paper or Ask Questions

FMSG-JLESS Submission for DCASE 2024 Task4 on Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels

Jun 29, 2024

Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

Abstract:This report presents the systems developed and submitted by Fortemedia Singapore (FMSG) and Joint Laboratory of Environmental Sound Sensing (JLESS) for DCASE 2024 Task 4. The task focuses on recognizing event classes and their time boundaries, given that multiple events can be present and may overlap in an audio recording. The novelty this year is a dataset with two sources, making it challenging to achieve good performance without knowing the source of the audio clips during evaluation. To address this, we propose a sound event detection method using domain generalization. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We focus on three main strategies to improve our method. First, we apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Second, we consider training loss of our model specific to each datasets for their corresponding classes. This independent learning framework helps the model extract domain-specific features effectively. Lastly, we use the sound event bounding boxes method for post-processing. Our proposed method shows superior macro-average pAUC and polyphonic SED score performance on the DCASE 2024 Challenge Task 4 validation dataset and public evaluation dataset.

* Technical report for DCASE 2024 Challenge Task 4

Via

Access Paper or Ask Questions

Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic Scene Classification under Domain Shift

Feb 05, 2024

Jisheng Bai, Mou Wang, Haohe Liu, Han Yin, Yafei Jia, Siwei Huang, Yutong Du, Dongzhe Zhang, Mark D. Plumbley, Dongyuan Shi(+4 more)

Figure 1 for Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic Scene Classification under Domain Shift

Figure 2 for Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic Scene Classification under Domain Shift

Figure 3 for Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic Scene Classification under Domain Shift

Figure 4 for Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic Scene Classification under Domain Shift

Abstract:Acoustic scene classification (ASC) is a crucial research problem in computational auditory scene analysis, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is domain shift caused by a distribution gap between training and testing data. Since 2018, ASC challenges have focused on the generalization of ASC models across different recording devices. Although this task in recent years has achieved substantial progress in device generalization, the challenge of domain shift between different regions, involving characteristics such as time, space, culture, and language, remains insufficiently explored at present. In addition, considering the abundance of unlabeled acoustic scene data in the real world, it is important to study the possible ways to utilize these unlabelled data. Therefore, we introduce the task Semi-supervised Acoustic Scene Classification under Domain Shift in the ICME 2024 Grand Challenge. We encourage participants to innovate with semi-supervised learning techniques, aiming to develop more robust ASC models under domain shift.

Via

Access Paper or Ask Questions

Sub-band and Full-band Interactive U-Net with DPRNN for Demixing Cross-talk Stereo Music

Jan 11, 2024

Han Yin, Mou Wang, Jisheng Bai, Dongyuan Shi, Woon-Seng Gan, Jianfeng Chen

Figure 1 for Sub-band and Full-band Interactive U-Net with DPRNN for Demixing Cross-talk Stereo Music

Figure 2 for Sub-band and Full-band Interactive U-Net with DPRNN for Demixing Cross-talk Stereo Music

Figure 3 for Sub-band and Full-band Interactive U-Net with DPRNN for Demixing Cross-talk Stereo Music

Figure 4 for Sub-band and Full-band Interactive U-Net with DPRNN for Demixing Cross-talk Stereo Music

Abstract:This paper presents a detailed description of our proposed methods for the ICASSP 2024 Cadenza Challenge. Experimental results show that the proposed system can achieve better performance than official baselines.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Interactive Dual-Conformer with Scene-Inspired Mask for Soft Sound Event Detection

Dec 07, 2023

Han Yin, Jisheng Bai, Mou Wang, Dongyuan Shi, Woon-Seng Gan, Jianfeng Chen

Abstract:Traditional binary hard labels for sound event detection (SED) lack details about the complexity and variability of sound event distributions. Recently, a novel annotation workflow is proposed to generate fine-grained non-binary soft labels, resulting in a new real-life dataset named MAESTRO Real for SED. In this paper, we first propose an interactive dual-conformer (IDC) module, in which a cross-interaction mechanism is applied to effectively exploit the information from soft labels. In addition, a novel scene-inspired mask (SIM) based on soft labels is incorporated for more precise SED predictions. The SIM is initially generated through a statistical approach, referred as SIM-V1. However, the fixed artificial mask may mismatch the SED model, resulting in limited effectiveness. Therefore, we further propose SIM-V2, which employs a word embedding model for adaptive SIM estimation. Experimental results show that the proposed IDC module can effectively utilize the information from soft labels, and the integration of SIM-V1 can further improve the accuracy. In addition, the impact of different word embedding dimensions on SIM-V2 is explored, and the results show that the appropriate dimension can enable SIM-V2 achieve superior performance than SIM-V1. In DCASE 2023 Challenge Task4B, the proposed system achieved the top ranking performance on the evaluation dataset of MAESTRO Real.

* to be improved (unfinished)

Via

Access Paper or Ask Questions

AudioLog: LLMs-Powered Long Audio Logging with Acoustic Scenes and Events Joint Estimation

Nov 21, 2023

Jisheng Bai, Han Yin, Mou Wang, Dongyuan Shi, Woon-Seng Gan, Jianfeng Chen

Abstract:Previous studies in automated audio captioning have faced difficulties in accurately capturing the complete temporal details of acoustic scenes and events within long audio sequences. This paper presents AudioLog, a large language models (LLMs)-powered audio logging system with multi-task learning of acoustic tasks. Specifically, we propose a joint training network, achieved by fine-tuning a large audio model based on the pre-trained hierarchical token-semantic audio Transformer. We then leverage LLMs to craft audio logs that summarize textual descriptions of the acoustic environment. Experiments show that the proposed system attains exceptional performance in acoustic scene classification and sound event detection, surpassing existing methods in the field. Further analyses demonstrate AudioLog's power in effectively summarizing long audio sequences.

* Submitted to ICASSP2024

Via

Access Paper or Ask Questions

Two-stage Autoencoder Neural Network for 3D Speech Enhancement

Jun 08, 2023

Han Yin, Jisheng Bai, Siwei Huang, Mou Wang, Yafei Jia, Jianfeng Chen

Abstract:3D speech enhancement has attracted much attention in recent years with the development of augmented reality technology. Traditional denoising convolutional autoencoders have limitations in extracting dynamic voice information. In this paper, we propose a two-stage autoencoder neural network for 3D speech enhancement. We incorporate a dual-path recurrent neural network block into the convolutional autoencoder to iteratively apply time-domain and frequency-domain modeling in an alternate fashion. And an attention mechanism for fusing the high-dimension features is proposed. We also introduce a loss function to simultaneously optimize the network in the time-frequency and time domains. Experimental results show that our system outperforms the state-of-the-art systems on the dataset of ICASSP L3DAS23 challenge.

* 5 pages,5 figures

Via

Access Paper or Ask Questions