Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tatsuya Komatsu

Self-supervised learning method using multiple sampling strategies for general-purpose audio representation

May 25, 2025

Ibuki Kuroyanagi, Tatsuya Komatsu

Abstract:We propose a self-supervised learning method using multiple sampling strategies to obtain general-purpose audio representation. Multiple sampling strategies are used in the proposed method to construct contrastive losses from different perspectives and learn representations based on them. In this study, in addition to the widely used clip-level sampling strategy, we introduce two new strategies, a frame-level strategy and a task-specific strategy. The proposed multiple strategies improve the performance of frame-level classification and other tasks like pitch detection, which are not the focus of the conventional single clip-level sampling strategy. We pre-trained the method on a subset of Audioset and applied it to a downstream task with frozen weights. The proposed method improved clip classification, sound event detection, and pitch detection performance by 25%, 20%, and 3.6%.

* 5 pages, 1 figure, 2 tables, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Via

Access Paper or Ask Questions

Music Tagging with Classifier Group Chains

Jan 09, 2025

Takuya Hasumi, Tatsuya Komatsu, Yusuke Fujita

Figure 1 for Music Tagging with Classifier Group Chains

Figure 2 for Music Tagging with Classifier Group Chains

Figure 3 for Music Tagging with Classifier Group Chains

Figure 4 for Music Tagging with Classifier Group Chains

Abstract:We propose music tagging with classifier chains that model the interplay of music tags. Most conventional methods estimate multiple tags independently by treating them as multiple independent binary classification problems. This treatment overlooks the conditional dependencies among music tags, leading to suboptimal tagging performance. Unlike most music taggers, the proposed method sequentially estimates each tag based on the idea of the classifier chains. Beyond the naive classifier chains, the proposed method groups the multiple tags by category, such as genre, and performs chains by unit of groups, which we call \textit{classifier group chains}. Our method allows the modeling of the dependence between tag groups. We evaluate the effectiveness of the proposed method for music tagging performance through music tagging experiments using the MTG-Jamendo dataset. Furthermore, we investigate the effective order of chains for music tagging.

* 5 pages, 2 figures

Via

Access Paper or Ask Questions

Pre-training with Synthetic Patterns for Audio

Oct 01, 2024

Yuchi Ishikawa, Tatsuya Komatsu, Yoshimitsu Aoki

Abstract:In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked counterparts. MAEs tend to focus on low-level information such as visual patterns and regularities within data. Therefore, it is unimportant what is portrayed in the input, whether it be images, audio mel-spectrograms, or even synthetic patterns. This leads to the second key element, which is synthetic data. Synthetic data, unlike real audio, is free from privacy and licensing infringement issues. By combining MAEs and synthetic patterns, our framework enables the model to learn generalized feature representations without real data, while addressing the issues related to real audio. To evaluate the efficacy of our framework, we conduct extensive experiments across a total of 13 audio tasks and 17 synthetic datasets. The experiments provide insights into which types of synthetic patterns are effective for audio. Our results demonstrate that our framework achieves performance comparable to models pre-trained on AudioSet-2M and partially outperforms image-based pre-training methods.

* Submitted to ICASSP'25

Via

Access Paper or Ask Questions

DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Sep 18, 2024

Shota Nakada, Taichi Nishimura, Hokuto Munakata, Masayoshi Kondo, Tatsuya Komatsu

Figure 1 for DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Figure 2 for DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Figure 3 for DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Figure 4 for DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Abstract:Current audio-visual representation learning can capture rough object categories (e.g., ``animals'' and ``instruments''), but it lacks the ability to recognize fine-grained details, such as specific categories like ``dogs'' and ``flutes'' within animals and instruments. To address this issue, we introduce DETECLAP, a method to enhance audio-visual representation learning with object information. Our key idea is to introduce an audio-visual label prediction loss to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its object awareness. To avoid costly manual annotations, we prepare object labels from both audio and visual inputs using state-of-the-art language-audio models and object detectors. We evaluate the method of audio-visual retrieval and classification using the VGGSound and AudioSet20K datasets. Our method achieves improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and visual-to-audio retrieval, respectively, and an improvement in accuracy of +0.6% for audio-visual classification.

* under review

Via

Access Paper or Ask Questions

Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection

Aug 06, 2024

Taichi Nishimura, Shota Nakada, Hokuto Munakata, Tatsuya Komatsu

Figure 1 for Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection

Figure 2 for Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection

Figure 3 for Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection

Figure 4 for Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection

Abstract:We propose Lighthouse, a user-friendly library for reproducible video moment retrieval and highlight detection (MR-HD). Although researchers proposed various MR-HD approaches, the research community holds two main issues. The first is a lack of comprehensive and reproducible experiments across various methods, datasets, and video-text features. This is because no unified training and evaluation codebase covers multiple settings. The second is user-unfriendly design. Because previous works use different libraries, researchers set up individual environments. In addition, most works release only the training codes, requiring users to implement the whole inference process of MR-HD. Lighthouse addresses these issues by implementing a unified reproducible codebase that includes six models, three features, and five datasets. In addition, it provides an inference API and web demo to make these methods easily accessible for researchers and developers. Our experiments demonstrate that Lighthouse generally reproduces the reported scores in the reference papers. The code is available at https://github.com/line/lighthouse.

* 6 pages; library tech report

Via

Access Paper or Ask Questions

Audio Fingerprinting with Holographic Reduced Representations

Jun 19, 2024

Yusuke Fujita, Tatsuya Komatsu

Abstract:This paper proposes an audio fingerprinting model with holographic reduced representation (HRR). The proposed method reduces the number of stored fingerprints, whereas conventional neural audio fingerprinting requires many fingerprints for each audio track to achieve high accuracy and time resolution. We utilize HRR to aggregate multiple fingerprints into a composite fingerprint via circular convolution and summation, resulting in fewer fingerprints with the same dimensional space as the original. Our search method efficiently finds a combined fingerprint in which a query fingerprint exists. Using HRR's inverse operation, it can recover the relative position within a combined fingerprint, retaining the original time resolution. Experiments show that our method can reduce the number of fingerprints with modest accuracy degradation while maintaining the time resolution, outperforming simple decimation and summation-based aggregation methods.

* accepted at Interspeech 2024

Via

Access Paper or Ask Questions

Universal Score-based Speech Enhancement with High Content Preservation

Jun 18, 2024

Robin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu

Abstract:We propose UNIVERSE++, a universal speech enhancement method based on score-based diffusion and adversarial training. Specifically, we improve the existing UNIVERSE model that decouples clean speech feature extraction and diffusion. Our contributions are three-fold. First, we make several modifications to the network architecture, improving training stability and final performance. Second, we introduce an adversarial loss to promote learning high quality speech features. Third, we propose a low-rank adaptation scheme with a phoneme fidelity loss to improve content preservation in the enhanced speech. In the experiments, we train a universal enhancement model on a large scale dataset of speech degraded by noise, reverberation, and various distortions. The results on multiple public benchmark datasets demonstrate that UNIVERSE++ compares favorably to both discriminative and generative baselines for a wide range of qualitative and intelligibility metrics.

* 5 pages, 5 figures, accepted at Interspeech 2024

Via

Access Paper or Ask Questions

Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers

Jan 22, 2024

Michael Hentschel, Yuta Nishikawa, Tatsuya Komatsu, Yusuke Fujita

Figure 1 for Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers

Figure 2 for Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers

Figure 3 for Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers

Figure 4 for Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers

Abstract:This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. To distil the teacher's knowledge, we use an attention decoder that learns from BERT's token probabilities. Our method shows that language model (LM) information can be more effectively distilled into an ASR model using both the intermediate layers and the final layer. By using the intermediate layers as distillation target, we can more effectively distil LM knowledge into the lower network layers. Using our method, we achieve better recognition accuracy than with shallow fusion of an external LM, allowing us to maintain fast parallel decoding. Experiments on the LibriSpeech dataset demonstrate the effectiveness of our approach in enhancing greedy decoding with connectionist temporal classification (CTC).

* Accepted at ICASSP 2024

Via

Access Paper or Ask Questions

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

Sep 15, 2023

Reo Shimizu, Ryuichi Yamamoto, Masaya Kawamura, Yuma Shirahata, Hironori Doi, Tatsuya Komatsu, Kentaro Tachibana

Abstract:We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of speaking style. Since there is no large-scale dataset containing speaker prompts, we first construct a dataset based on the LibriTTS-R corpus with manually annotated speaker prompts. We then employ a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. Unlike previous studies that rely on style prompts describing only a limited aspect of speaker individuality, such as pitch, speaking speed, and energy, our method utilizes an additional speaker prompt to effectively learn the mapping from natural language descriptions to the acoustic features of diverse speakers. Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt. Audio samples are available at https://reppy4620.github.io/demo.promptttspp/.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Audio Difference Learning for Audio Captioning

Sep 15, 2023

Tatsuya Komatsu, Yusuke Fujita, Kazuya Takeda, Tomoki Toda

Figure 1 for Audio Difference Learning for Audio Captioning

Figure 2 for Audio Difference Learning for Audio Captioning

Figure 3 for Audio Difference Learning for Audio Captioning

Abstract:This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship between audio, enabling the generation of captions that detail intricate audio information. This method employs a reference audio along with the input audio, both of which are transformed into feature representations via a shared encoder. Captions are then generated from these differential features to describe their differences. Furthermore, a unique technique is proposed that involves mixing the input audio with additional audio, and using the additional audio as a reference. This results in the difference between the mixed audio and the reference audio reverting back to the original input audio. This allows the original input's caption to be used as the caption for their difference, eliminating the need for additional annotations for the differences. In the experiments using the Clotho and ESC50 datasets, the proposed method demonstrated an improvement in the SPIDEr score by 7% compared to conventional methods.

* submitted to ICASSP2024

Via

Access Paper or Ask Questions