Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kentaro Seki

SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis

Aug 13, 2024

Osamu Take, Shinnosuke Takamichi, Kentaro Seki, Yoshiaki Bando, Hiroshi Saruwatari

Abstract:This paper presents SaSLaW, a spontaneous dialogue speech corpus containing synchronous recordings of what speakers speak, listen to, and watch. Humans consider the diverse environmental factors and then control the features of their utterances in face-to-face voice communications. Spoken dialogue systems capable of this adaptation to these audio environments enable natural and seamless communications. SaSLaW was developed to model human-speech adjustment for audio environments via first-person audio-visual perceptions in spontaneous dialogues. We propose the construction methodology of SaSLaW and display the analysis result of the corpus. We additionally conducted an experiment to develop text-to-speech models using SaSLaW and evaluate their performance of adaptations to audio environments. The results indicate that models incorporating hearing-audio data output more plausible speech tailored to diverse audio environments than the vanilla text-to-speech model.

* 5 pages, accepted for INTERSPEECH 2024

Via

Access Paper or Ask Questions

J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

Jul 22, 2024

Wataru Nakata, Kentaro Seki, Hitomi Yanaka, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

Figure 1 for J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

Figure 2 for J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

Figure 3 for J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

Figure 4 for J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

Abstract:Spoken dialogue plays a crucial role in human-AI interactions, necessitating dialogue-oriented spoken language models (SLMs). To develop versatile SLMs, large-scale and diverse speech datasets are essential. Additionally, to ensure hiqh-quality speech generation, the data must be spontaneous like in-wild data and must be acoustically clean with noise removed. Despite the critical need, no open-source corpus meeting all these criteria has been available. This study addresses this gap by constructing and releasing a large-scale spoken dialogue corpus, named Japanese Corpus for Human-AI Talks (J-CHAT), which is publicly accessible. Furthermore, this paper presents a language-independent method for corpus construction and describes experiments on dialogue generation using SLMs trained on J-CHAT. Experimental results indicate that the collected data from multiple domains by our method improve the naturalness and meaningfulness of dialogue generation.

* 8 pages, 6 figures

Via

Access Paper or Ask Questions

Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals

Jun 25, 2024

Kentaro Seki, Shinnosuke Takamichi, Norihiro Takamune, Yuki Saito, Kanami Imamura, Hiroshi Saruwatari

Figure 1 for Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals

Figure 2 for Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals

Figure 3 for Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals

Figure 4 for Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals

Abstract:This paper proposes a new task called spatial voice conversion, which aims to convert a target voice while preserving spatial information and non-target signals. Traditional voice conversion methods focus on single-channel waveforms, ignoring the stereo listening experience inherent in human hearing. Our baseline approach addresses this gap by integrating blind source separation (BSS), voice conversion (VC), and spatial mixing to handle multi-channel waveforms. Through experimental evaluations, we organize and identify the key challenges inherent in this task, such as maintaining audio quality and accurately preserving spatial information. Our results highlight the fundamental difficulties in balancing these aspects, providing a benchmark for future research in spatial voice conversion. The proposed method's code is publicly available to encourage further exploration in this domain.

* Accepted to Interspeech 2024

Via

Access Paper or Ask Questions

Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

Jun 11, 2024

Takuto Igarashi, Yuki Saito, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

Figure 1 for Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

Figure 2 for Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

Figure 3 for Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

Figure 4 for Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

Abstract:We propose noise-robust voice conversion (VC) which takes into account the recording quality and environment of noisy source speech. Conventional denoising training improves the noise robustness of a VC model by learning noisy-to-clean VC process. However, the naturalness of the converted speech is limited when the noise of the source speech is unseen during the training. To this end, our proposed training conditions a VC model on two latent variables representing the recording quality and environment of the source speech. These latent variables are derived from deep neural networks pre-trained on recording quality assessment and acoustic scene classification and calculated in an utterance-wise or frame-wise manner. As a result, the trained VC model can explicitly learn information about speech degradation during the training. Objective and subjective evaluations show that our training improves the quality of the converted speech compared to the conventional training.

* 5 pages, accepted for INTERSPEECH 2024, audio samples: http://y-saito.sakura.ne.jp/sython/Corpus/SRC4VC/IS2024_CDT_supplementary/demo_cdt.html

Via

Access Paper or Ask Questions

SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark

Jun 11, 2024

Yuki Saito, Takuto Igarashi, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

Figure 1 for SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark

Figure 2 for SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark

Figure 3 for SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark

Figure 4 for SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark

Abstract:We present SRC4VC, a new corpus containing 11 hours of speech recorded on smartphones by 100 Japanese speakers. Although high-quality multi-speaker corpora can advance voice conversion (VC) technologies, they are not always suitable for testing VC when low-quality speech recording is given as the input. To this end, we first asked 100 crowdworkers to record their voice samples using smartphones. Then, we annotated the recorded samples with speaker-wise recording-quality scores and utterance-wise perceived emotion labels. We also benchmark SRC4VC on any-to-any VC, in which we trained a multi-speaker VC model on high-quality speech and used the SRC4VC speakers' voice samples as the source in VC. The results show that the recording quality mismatch between the training and evaluation data significantly degrades the VC performance, which can be improved by applying speech enhancement to the low-quality source speech samples.

* Accepted for INTERSPEECH 2024, corpus project page: https://y-saito.sakura.ne.jp/sython/Corpus/SRC4VC/index.html

Via

Access Paper or Ask Questions

Diversity-based core-set selection for text-to-speech with linguistic and acoustic features

Sep 15, 2023

Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

Figure 1 for Diversity-based core-set selection for text-to-speech with linguistic and acoustic features

Figure 2 for Diversity-based core-set selection for text-to-speech with linguistic and acoustic features

Figure 3 for Diversity-based core-set selection for text-to-speech with linguistic and acoustic features

Figure 4 for Diversity-based core-set selection for text-to-speech with linguistic and acoustic features

Abstract:This paper proposes a method for extracting a lightweight subset from a text-to-speech (TTS) corpus ensuring synthetic speech quality. In recent years, methods have been proposed for constructing large-scale TTS corpora by collecting diverse data from massive sources such as audiobooks and YouTube. Although these methods have gained significant attention for enhancing the expressive capabilities of TTS systems, they often prioritize collecting vast amounts of data without considering practical constraints like storage capacity and computation time in training, which limits the available data quantity. Consequently, the need arises to efficiently collect data within these volume constraints. To address this, we propose a method for selecting the core subset~(known as \textit{core-set}) from a TTS corpus on the basis of a \textit{diversity metric}, which measures the degree to which a subset encompasses a wide range. Experimental results demonstrate that our proposed method performs significantly better than the baseline phoneme-balanced data selection across language and corpus size.

Via

Access Paper or Ask Questions

How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics

Jun 01, 2023

Joonyong Park, Shinnosuke Takamichi, Tomohiko Nakamura, Kentaro Seki, Detai Xin, Hiroshi Saruwatari

Figure 1 for How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics

Figure 2 for How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics

Figure 3 for How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics

Figure 4 for How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics

Abstract:We examine the speech modeling potential of generative spoken language modeling (GSLM), which involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis. Since GSLM facilitates textless spoken language processing, exploring its effectiveness is critical for paving the way for novel paradigms in spoken-language processing. This paper presents the findings of GSLM's encoding and decoding effectiveness at the spoken-language and speech levels. Through speech resynthesis experiments, we revealed that resynthesis errors occur at the levels ranging from phonology to syntactics and GSLM frequently resynthesizes natural but content-altered speech.

* Accepted to INTERSPEECH 2023

Via

Access Paper or Ask Questions

Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection

Oct 26, 2022

Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

Abstract:This paper proposes a method for selecting training data for text-to-speech (TTS) synthesis from dark data. TTS models are typically trained on high-quality speech corpora that cost much time and money for data collection, which makes it very challenging to increase speaker variation. In contrast, there is a large amount of data whose availability is unknown (a.k.a, "dark data"), such as YouTube videos. To utilize data other than TTS corpora, previous studies have selected speech data from the corpora on the basis of acoustic quality. However, considering that TTS models robust to data noise have been proposed, we should select data on the basis of its importance as training data to the given TTS model, not the quality of speech itself. Our method with a loop of training and evaluation selects training data on the basis of the automatically predicted quality of synthetic speech of a given TTS model. Results of evaluations using YouTube data reveal that our method outperforms the conventional acoustic-quality-based method.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions