Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Soyeon Choe

An empirical study on speech restoration guided by self supervised speech representation

May 30, 2023

Jaeuk Byun, Youna Ji, Soo Whan Chung, Soyeon Choe, Min Seok Choi

Figure 1 for An empirical study on speech restoration guided by self supervised speech representation

Figure 2 for An empirical study on speech restoration guided by self supervised speech representation

Figure 3 for An empirical study on speech restoration guided by self supervised speech representation

Figure 4 for An empirical study on speech restoration guided by self supervised speech representation

Abstract:Enhancing speech quality is an indispensable yet difficult task as it is often complicated by a range of degradation factors. In addition to additive noise, reverberation, clipping, and speech attenuation can all adversely affect speech quality. Speech restoration aims to recover speech components from these distortions. This paper focuses on exploring the impact of self-supervised speech representation learning on the speech restoration task. Specifically, we employ speech representation in various speech restoration networks and evaluate their performance under complicated distortion scenarios. Our experiments demonstrate that the contextual information provided by the self-supervised speech representation can enhance speech restoration performance in various distortion scenarios, while also increasing robustness against the duration of speech attenuation and mismatched test conditions.

* To be presented at ICASSP 2023

Via

Access Paper or Ask Questions

Diffusion-based Generative Speech Source Separation

Nov 02, 2022

Robin Scheibler, Youna Ji, Soo-Whan Chung, Jaeuk Byun, Soyeon Choe, Min-Seok Choi

Abstract:We propose DiffSep, a new single channel source separation method based on score-matching of a stochastic differential equation (SDE). We craft a tailored continuous time diffusion-mixing process starting from the separated sources and converging to a Gaussian distribution centered on their mixture. This formulation lets us apply the machinery of score-based generative modelling. First, we train a neural network to approximate the score function of the marginal probabilities or the diffusion-mixing process. Then, we use it to solve the reverse time SDE that progressively separates the sources starting from their mixture. We propose a modified training strategy to handle model mismatch and source permutation ambiguity. Experiments on the WSJ0 2mix dataset demonstrate the potential of the method. Furthermore, the method is also suitable for speech enhancement and shows performance competitive with prior work on the VoiceBank-DEMAND dataset.

* 5 pages, 3 figures, 2 tables. Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

Look Who's Talking: Active Speaker Detection in the Wild

Aug 17, 2021

You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-Jin Lee, Youngki Kwon, Joon Son Chung

Figure 1 for Look Who's Talking: Active Speaker Detection in the Wild

Figure 2 for Look Who's Talking: Active Speaker Detection in the Wild

Figure 3 for Look Who's Talking: Active Speaker Detection in the Wild

Figure 4 for Look Who's Talking: Active Speaker Detection in the Wild

Abstract:In this work, we present a novel audio-visual dataset for active speaker detection in the wild. A speaker is considered active when his or her face is visible and the voice is audible simultaneously. Although active speaker detection is a crucial pre-processing step for many audio-visual tasks, there is no existing dataset of natural human speech to evaluate the performance of active speaker detection. We therefore curate the Active Speakers in the Wild (ASW) dataset which contains videos and co-occurring speech segments with dense speech activity labels. Videos and timestamps of audible segments are parsed and adopted from VoxConverse, an existing speaker diarisation dataset that consists of videos in the wild. Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way. Two reference systems, a self-supervised system and a fully supervised one, are evaluated on the dataset to provide the baseline performances of ASW. Cross-domain evaluation is conducted in order to show the negative effect of dubbed videos in the training data.

* To appear in Interspeech 2021. Data will be available from https://github.com/clovaai/lookwhostalking

Via

Access Paper or Ask Questions

FaceFilter: Audio-visual speech separation using still images

May 14, 2020

Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang

Figure 1 for FaceFilter: Audio-visual speech separation using still images

Figure 2 for FaceFilter: Audio-visual speech separation using still images

Figure 3 for FaceFilter: Audio-visual speech separation using still images

Figure 4 for FaceFilter: Audio-visual speech separation using still images

Abstract:The objective of this paper is to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network. Unlike previous works that used lip movement on video clips or pre-enrolled speaker information as an auxiliary conditional feature, we use a single face image of the target speaker. In this task, the conditional feature is obtained from facial appearance in cross-modal biometric task, where audio and visual identity representations are shared in latent space. Learnt identities from facial images enforce the network to isolate matched speakers and extract the voices from mixed speech. It solves the permutation problem caused by swapped channel outputs, frequently occurred in speech separation tasks. The proposed method is far more practical than video-based speech separation since user profile images are readily available on many platforms. Also, unlike speaker-aware separation methods, it is applicable on separation with unseen speakers who have never been enrolled before. We show strong qualitative and quantitative results on challenging real-world examples.

* Under submission as a conference paper. Video examples: https://youtu.be/ku9xoLh62E

Via

Access Paper or Ask Questions