Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yasunori Ohishi

Description and Discussion on DCASE 2025 Challenge Task 4: Spatial Semantic Segmentation of Sound Scenes

Jun 12, 2025

Masahiro Yasuda, Binh Thien Nguyen, Noboru Harada, Romain Serizel, Mayank Mishra, Marc Delcroix, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Yasunori Ohishi(+3 more)

Figure 1 for Description and Discussion on DCASE 2025 Challenge Task 4: Spatial Semantic Segmentation of Sound Scenes

Figure 2 for Description and Discussion on DCASE 2025 Challenge Task 4: Spatial Semantic Segmentation of Sound Scenes

Figure 3 for Description and Discussion on DCASE 2025 Challenge Task 4: Spatial Semantic Segmentation of Sound Scenes

Abstract:Spatial Semantic Segmentation of Sound Scenes (S5) aims to enhance technologies for sound event detection and separation from multi-channel input signals that mix multiple sound events with spatial information. This is a fundamental basis of immersive communication. The ultimate goal is to separate sound event signals with 6 Degrees of Freedom (6DoF) information into dry sound object signals and metadata about the object type (sound event class) and representing spatial information, including direction. However, because several existing challenge tasks already provide some of the subset functions, this task for this year focuses on detecting and separating sound events from multi-channel spatial input signals. This paper outlines the S5 task setting of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge Task 4 and the DCASE2025 Task 4 Dataset, newly recorded and curated for this task. We also report experimental results for an S5 system trained and evaluated on this dataset. The full version of this paper will be published after the challenge results are made public.

Via

Access Paper or Ask Questions

Towards Pre-training an Effective Respiratory Audio Foundation Model

May 21, 2025

Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Yasunori Ohishi, Noboru Harada

Abstract:Recent advancements in foundation models have sparked interest in respiratory audio foundation models. However, the effectiveness of applying conventional pre-training schemes to datasets that are small-sized and lack diversity has not been sufficiently verified. This study aims to explore better pre-training practices for respiratory sounds by comparing numerous pre-trained audio models. Our investigation reveals that models pre-trained on AudioSet, a general audio dataset, are more effective than the models specifically pre-trained on respiratory sounds. Moreover, combining AudioSet and respiratory sound datasets for further pre-training enhances performance, and preserving the frequency-wise information when aggregating features is vital. Along with more insights found in the experiments, we establish a new state-of-the-art for the OPERA benchmark, contributing to advancing respiratory audio foundation models. Our code is available online at https://github.com/nttcslab/eval-audio-repr/tree/main/plugin/OPERA.

* 5 pages, 2 figures, 4 tables, Accepted by Interspeech 2025

Via

Access Paper or Ask Questions

Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis

Apr 25, 2025

Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Yasunori Ohishi, Noboru Harada

Figure 1 for Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis

Figure 2 for Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis

Figure 3 for Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis

Figure 4 for Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis

Abstract:Pre-trained deep learning models, known as foundation models, have become essential building blocks in machine learning domains such as natural language processing and image domains. This trend has extended to respiratory and heart sound models, which have demonstrated effectiveness as off-the-shelf feature extractors. However, their evaluation benchmarking has been limited, resulting in incompatibility with state-of-the-art (SOTA) performance, thus hindering proof of their effectiveness. This study investigates the practical effectiveness of off-the-shelf audio foundation models by comparing their performance across four respiratory and heart sound tasks with SOTA fine-tuning results. Experiments show that models struggled on two tasks with noisy data but achieved SOTA performance on the other tasks with clean data. Moreover, general-purpose audio models outperformed a respiratory sound model, highlighting their broader applicability. With gained insights and the released code, we contribute to future research on developing and leveraging foundation models for respiratory and heart sounds.

* 4 pages, 1 figure, and 4 tables. Accepted by IEEE EMBC 2025

Via

Access Paper or Ask Questions

Baseline Systems and Evaluation Metrics for Spatial Semantic Segmentation of Sound Scenes

Mar 28, 2025

Binh Thien Nguyen, Masahiro Yasuda, Daiki Takeuchi, Daisuke Niizumi, Yasunori Ohishi, Noboru Harada

Abstract:Immersive communication has made significant advancements, especially with the release of the codec for Immersive Voice and Audio Services. Aiming at its further realization, the DCASE 2025 Challenge has recently introduced a task for spatial semantic segmentation of sound scenes (S5), which focuses on detecting and separating sound events in spatial sound scenes. In this paper, we explore methods for addressing the S5 task. Specifically, we present baseline S5 systems that combine audio tagging (AT) and label-queried source separation (LSS) models. We investigate two LSS approaches based on the ResUNet architecture: a) extracting a single source for each detected event and b) querying multiple sources concurrently. Since each separated source in S5 is identified by its sound event class label, we propose new class-aware metrics to evaluate both the sound sources and labels simultaneously. Experimental results on first-order ambisonics spatial audio demonstrate the effectiveness of the proposed systems and confirm the efficacy of the metrics.

* 5 pages

Via

Access Paper or Ask Questions

M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAP

Mar 28, 2025

Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Yasunori Ohishi, Noboru Harada

Abstract:Contrastive language-audio pre-training (CLAP) has addressed audio-language tasks such as audio-text retrieval by aligning audio and text in a common feature space. While CLAP addresses general audio-language tasks, its audio features do not generalize well in audio tasks. In contrast, self-supervised learning (SSL) models learn general-purpose audio features that perform well in diverse audio tasks. We pursue representation learning that can be widely used in audio applications and hypothesize that a method that learns both general audio features and CLAP features should achieve our goal, which we call a general-purpose audio-language representation. To implement our hypothesis, we propose M2D2, a second-generation masked modeling duo (M2D) that combines an SSL M2D and CLAP. M2D2 learns two types of features using two modalities (audio and text) in a two-stage training process. It also utilizes advanced LLM-based sentence embeddings in CLAP training for powerful semantic supervision. In the first stage, M2D2 learns generalizable audio features from M2D and CLAP, where CLAP aligns the features with the fine LLM-based semantic embeddings. In the second stage, it learns CLAP features using the audio features learned from the LLM-based embeddings. Through these pre-training stages, M2D2 should enhance generalizability and performance in its audio and CLAP features. Experiments validated that M2D2 achieves effective general-purpose audio-language representation, highlighted with SOTA fine-tuning mAP of 49.0 for AudioSet, SOTA performance in music tasks, and top-level performance in audio-language tasks.

* 15 pages, 7 figures, 13 tables, under review at an IEEE journal

Via

Access Paper or Ask Questions

M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation

Jun 04, 2024

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, Keisuke Imoto

Figure 1 for M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation

Figure 2 for M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation

Figure 3 for M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation

Figure 4 for M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation

Abstract:Contrastive language-audio pre-training (CLAP) enables zero-shot (ZS) inference of audio and exhibits promising performance in several classification tasks. However, conventional audio representations are still crucial for many tasks where ZS is not applicable (e.g., regression problems). Here, we explore a new representation, a general-purpose audio-language representation, that performs well in both ZS and transfer learning. To do so, we propose a new method, M2D-CLAP, which combines self-supervised learning Masked Modeling Duo (M2D) and CLAP. M2D learns an effective representation to model audio signals, and CLAP aligns the representation with text embedding. As a result, M2D-CLAP learns a versatile representation that allows for both ZS and transfer learning. Experiments show that M2D-CLAP performs well on linear evaluation, fine-tuning, and ZS classification with a GTZAN state-of-the-art of 75.17%, thus achieving a general-purpose audio-language representation.

* 5 pages, 1 figure, 5 tables. Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

Exploring Pre-trained General-purpose Audio Representations for Heart Murmur Detection

Apr 26, 2024

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

Figure 1 for Exploring Pre-trained General-purpose Audio Representations for Heart Murmur Detection

Figure 2 for Exploring Pre-trained General-purpose Audio Representations for Heart Murmur Detection

Figure 3 for Exploring Pre-trained General-purpose Audio Representations for Heart Murmur Detection

Figure 4 for Exploring Pre-trained General-purpose Audio Representations for Heart Murmur Detection

Abstract:To reduce the need for skilled clinicians in heart sound interpretation, recent studies on automating cardiac auscultation have explored deep learning approaches. However, despite the demands for large data for deep learning, the size of the heart sound datasets is limited, and no pre-trained model is available. On the contrary, many pre-trained models for general audio tasks are available as general-purpose audio representations. This study explores the potential of general-purpose audio representations pre-trained on large-scale datasets for transfer learning in heart murmur detection. Experiments on the CirCor DigiScope heart sound dataset show that the recent self-supervised learning Masked Modeling Duo (M2D) outperforms previous methods with the results of a weighted accuracy of 0.832 and an unweighted average recall of 0.713. Experiments further confirm improved performance by ensembling M2D with other models. These results demonstrate the effectiveness of general-purpose audio representation in processing heart sounds and open the way for further applications. Our code is available online which runs on a 24 GB consumer GPU at https://github.com/nttcslab/m2d/tree/master/app/circor

* 4 pages, 1 figure, and 4 tables. Accepted by IEEE EMBC 2024

Via

Access Paper or Ask Questions

Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis

Apr 12, 2024

Masahiro Yasuda, Noboru Harada, Yasunori Ohishi, Shoichiro Saito, Akira Nakayama, Nobutaka Ono

Figure 1 for Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis

Figure 2 for Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis

Figure 3 for Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis

Figure 4 for Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis

Abstract:Observations with distributed sensors are essential in analyzing a series of human and machine activities (referred to as 'events' in this paper) in complex and extensive real-world environments. This is because the information obtained from a single sensor is often missing or fragmented in such an environment; observations from multiple locations and modalities should be integrated to analyze events comprehensively. However, a learning method has yet to be established to extract joint representations that effectively combine such distributed observations. Therefore, we propose Guided Masked sELf-Distillation modeling (Guided-MELD) for inter-sensor relationship modeling. The basic idea of Guided-MELD is to learn to supplement the information from the masked sensor with information from other sensors needed to detect the event. Guided-MELD is expected to enable the system to effectively distill the fragmented or redundant target event information obtained by the sensors without being overly dependent on any specific sensors. To validate the effectiveness of the proposed method in novel tasks of distributed multimedia sensor event analysis, we recorded two new datasets that fit the problem setting: MM-Store and MM-Office. These datasets consist of human activities in a convenience store and an office, recorded using distributed cameras and microphones. Experimental results on these datasets show that the proposed Guided-MELD improves event tagging and detection performance and outperforms conventional inter-sensor relationship modeling methods. Furthermore, the proposed method performed robustly even when sensors were reduced.

* 13page, 7figure, under review

Via

Access Paper or Ask Questions

Masked Modeling Duo: Towards a Universal Audio Pre-training Framework

Apr 09, 2024

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

Figure 1 for Masked Modeling Duo: Towards a Universal Audio Pre-training Framework

Figure 2 for Masked Modeling Duo: Towards a Universal Audio Pre-training Framework

Figure 3 for Masked Modeling Duo: Towards a Universal Audio Pre-training Framework

Figure 4 for Masked Modeling Duo: Towards a Universal Audio Pre-training Framework

Abstract:Self-supervised learning (SSL) using masked prediction has made great strides in general-purpose audio representation. This study proposes Masked Modeling Duo (M2D), an improved masked prediction SSL, which learns by predicting representations of masked input signals that serve as training signals. Unlike conventional methods, M2D obtains a training signal by encoding only the masked part, encouraging the two networks in M2D to model the input. While M2D improves general-purpose audio representations, a specialized representation is essential for real-world applications, such as in industrial and medical domains. The often confidential and proprietary data in such domains is typically limited in size and has a different distribution from that in pre-training datasets. Therefore, we propose M2D for X (M2D-X), which extends M2D to enable the pre-training of specialized representations for an application X. M2D-X learns from M2D and an additional task and inputs background noise. We make the additional task configurable to serve diverse applications, while the background noise helps learn on small data and forms a denoising task that makes representation robust. With these design choices, M2D-X should learn a representation specialized to serve various application needs. Our experiments confirmed that the representations for general-purpose audio, specialized for the highly competitive AudioSet and speech domain, and a small-data medical task achieve top-level performance, demonstrating the potential of using our models as a universal audio pre-training framework. Our code is available online for future studies at https://github.com/nttcslab/m2d

* 15 pages, 6 figures, 15 tables. Accepted by TASLP

Via

Access Paper or Ask Questions

Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval

Mar 16, 2024

Shunsuke Tsubaki, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Keisuke Imoto

Figure 1 for Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval

Figure 2 for Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval

Figure 3 for Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval

Figure 4 for Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval

Abstract:The aim of this research is to refine knowledge transfer on audio-image temporal agreement for audio-text cross retrieval. To address the limited availability of paired non-speech audio-text data, learning methods for transferring the knowledge acquired from a large amount of paired audio-image data to shared audio-text representation have been investigated, suggesting the importance of how audio-image co-occurrence is learned. Conventional approaches in audio-image learning assign a single image randomly selected from the corresponding video stream to the entire audio clip, assuming their co-occurrence. However, this method may not accurately capture the temporal agreement between the target audio and image because a single image can only represent a snapshot of a scene, though the target audio changes from moment to moment. To address this problem, we propose two methods for audio and image matching that effectively capture the temporal information: (i) Nearest Match wherein an image is selected from multiple time frames based on similarity with audio, and (ii) Multiframe Match wherein audio and image pairs of multiple time frames are used. Experimental results show that method (i) improves the audio-text retrieval performance by selecting the nearest image that aligns with the audio information and transferring the learned knowledge. Conversely, method (ii) improves the performance of audio-image retrieval while not showing significant improvements in audio-text retrieval performance. These results indicate that refining audio-image temporal agreement may contribute to better knowledge transfer to audio-text retrieval.

* Submitted to EUSIPCO2024

Via

Access Paper or Ask Questions