Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Janghoon Cho

Domain Agnostic Few-shot Learning for Speaker Verification

Jun 28, 2022

Seunghan Yang, Debasmit Das, Janghoon Cho, Hyoungwoo Park, Sungrack Yun

Figure 1 for Domain Agnostic Few-shot Learning for Speaker Verification

Figure 2 for Domain Agnostic Few-shot Learning for Speaker Verification

Figure 3 for Domain Agnostic Few-shot Learning for Speaker Verification

Figure 4 for Domain Agnostic Few-shot Learning for Speaker Verification

Abstract:Deep learning models for verification systems often fail to generalize to new users and new environments, even though they learn highly discriminative features. To address this problem, we propose a few-shot domain generalization framework that learns to tackle distribution shift for new users and new domains. Our framework consists of domain-specific and domain-aggregation networks, which are the experts on specific and combined domains, respectively. By using these networks, we generate episodes that mimic the presence of both novel users and novel domains in the training phase to eventually produce better generalization. To save memory, we reduce the number of domain-specific networks by clustering similar domains together. Upon extensive evaluation on artificially generated noise domains, we can explicitly show generalization ability of our framework. In addition, we apply our proposed methods to the existing competitive architecture on the standard benchmark, which shows further performance improvements.

* Proceedings of INTERSPEECH 2022

Via

Access Paper or Ask Questions

SubSpectral Normalization for Neural Audio Data Processing

Mar 25, 2021

Simyung Chang, Hyoungwoo Park, Janghoon Cho, Hyunsin Park, Sungrack Yun, Kyuwoong Hwang

Figure 1 for SubSpectral Normalization for Neural Audio Data Processing

Figure 2 for SubSpectral Normalization for Neural Audio Data Processing

Figure 3 for SubSpectral Normalization for Neural Audio Data Processing

Figure 4 for SubSpectral Normalization for Neural Audio Data Processing

Abstract:Convolutional Neural Networks are widely used in various machine learning domains. In image processing, the features can be obtained by applying 2D convolution to all spatial dimensions of the input. However, in the audio case, frequency domain input like Mel-Spectrogram has different and unique characteristics in the frequency dimension. Thus, there is a need for a method that allows the 2D convolution layer to handle the frequency dimension differently. In this work, we introduce SubSpectral Normalization (SSN), which splits the input frequency dimension into several groups (sub-bands) and performs a different normalization for each group. SSN also includes an affine transformation that can be applied to each group. Our method removes the inter-frequency deflection while the network learns a frequency-aware characteristic. In the experiments with audio data, we observed that SSN can efficiently improve the network's performance.

* 4 pages, ICASSP '21 accepted

Via

Access Paper or Ask Questions

End-to-End Lane Marker Detection via Row-wise Classification

May 06, 2020

Seungwoo Yoo, Heeseok Lee, Heesoo Myeong, Sungrack Yun, Hyoungwoo Park, Janghoon Cho, Duck Hoon Kim

Figure 1 for End-to-End Lane Marker Detection via Row-wise Classification

Figure 2 for End-to-End Lane Marker Detection via Row-wise Classification

Figure 3 for End-to-End Lane Marker Detection via Row-wise Classification

Figure 4 for End-to-End Lane Marker Detection via Row-wise Classification

Abstract:In autonomous driving, detecting reliable and accurate lane marker positions is a crucial yet challenging task. The conventional approaches for the lane marker detection problem perform a pixel-level dense prediction task followed by sophisticated post-processing that is inevitable since lane markers are typically represented by a collection of line segments without thickness. In this paper, we propose a method performing direct lane marker vertex prediction in an end-to-end manner, i.e., without any post-processing step that is required in the pixel-level dense prediction task. Specifically, we translate the lane marker detection problem into a row-wise classification task, which takes advantage of the innate shape of lane markers but, surprisingly, has not been explored well. In order to compactly extract sufficient information about lane markers which spread from the left to the right in an image, we devise a novel layer, which is utilized to successively compress horizontal components so enables an end-to-end lane marker detection system where the final lane marker positions are simply obtained via argmax operations in testing time. Experimental results demonstrate the effectiveness of the proposed method, which is on par or outperforms the state-of-the-art methods on two popular lane marker detection benchmarks, i.e., TuSimple and CULane.

Via

Access Paper or Ask Questions

Weakly Labeled Sound Event Detection Using Tri-training and Adversarial Learning

Oct 14, 2019

Hyoungwoo Park, Sungrack Yun, Jungyun Eum, Janghoon Cho, Kyuwoong Hwang

Figure 1 for Weakly Labeled Sound Event Detection Using Tri-training and Adversarial Learning

Figure 2 for Weakly Labeled Sound Event Detection Using Tri-training and Adversarial Learning

Figure 3 for Weakly Labeled Sound Event Detection Using Tri-training and Adversarial Learning

Figure 4 for Weakly Labeled Sound Event Detection Using Tri-training and Adversarial Learning

Abstract:This paper considers a semi-supervised learning framework for weakly labeled polyphonic sound event detection problems for the DCASE 2019 challenge's task4 by combining both the tri-training and adversarial learning. The goal of the task4 is to detect onsets and offsets of multiple sound events in a single audio clip. The entire dataset consists of the synthetic data with a strong label (sound event labels with boundaries) and real data with weakly labeled (sound event labels) and unlabeled dataset. Given this dataset, we apply the tri-training where two different classifiers are used to obtain pseudo labels on the weakly labeled and unlabeled dataset, and the final classifier is trained using the strongly labeled dataset and weakly/unlabeled dataset with pseudo labels. Also, we apply the adversarial learning to reduce the domain gap between the real and synthetic dataset. We evaluated our learning framework using the validation set of the task4 dataset, and in the experiments, our learning framework shows a considerable performance improvement over the baseline model.

* 5 pages, DCASE 2019 Workshop

Via

Access Paper or Ask Questions

Acoustic Scene Classification Based on a Large-margin Factorized CNN

Oct 14, 2019

Janghoon Cho, Sungrack Yun, Hyoungwoo Park, Jungyun Eum, Kyuwoong Hwang

Figure 1 for Acoustic Scene Classification Based on a Large-margin Factorized CNN

Figure 2 for Acoustic Scene Classification Based on a Large-margin Factorized CNN

Figure 3 for Acoustic Scene Classification Based on a Large-margin Factorized CNN

Figure 4 for Acoustic Scene Classification Based on a Large-margin Factorized CNN

Abstract:In this paper, we present an acoustic scene classification framework based on a large-margin factorized convolutional neural network (CNN). We adopt the factorized CNN to learn the patterns in the time-frequency domain by factorizing the 2D kernel into two separate 1D kernels. The factorized kernel leads to learn the main component of two patterns: the long-term ambient and short-term event sounds which are the key patterns of the audio scene classification. In training our model, we consider the loss function based on the triplet sampling such that the same audio scene samples from different environments are minimized, and simultaneously the different audio scene samples are maximized. With this loss function, the samples from the same audio scene are clustered independently of the environment, and thus we can get the classifier with better generalization ability in an unseen environment. We evaluated our audio scene classification framework using the dataset of the DCASE challenge 2019 task1A. Experimental results show that the proposed algorithm improves the performance of the baseline network and reduces the number of parameters to one third. Furthermore, the performance gain is higher on unseen data, and it shows that the proposed algorithm has better generalization ability.

* 5 pages, DCASE 2019 Workshop

Via

Access Paper or Ask Questions

An End-to-End Text-independent Speaker Verification Framework with a Keyword Adversarial Network

Aug 06, 2019

Sungrack Yun, Janghoon Cho, Jungyun Eum, Wonil Chang, Kyuwoong Hwang

Figure 1 for An End-to-End Text-independent Speaker Verification Framework with a Keyword Adversarial Network

Figure 2 for An End-to-End Text-independent Speaker Verification Framework with a Keyword Adversarial Network

Figure 3 for An End-to-End Text-independent Speaker Verification Framework with a Keyword Adversarial Network

Figure 4 for An End-to-End Text-independent Speaker Verification Framework with a Keyword Adversarial Network

Abstract:This paper presents an end-to-end text-independent speaker verification framework by jointly considering the speaker embedding (SE) network and automatic speech recognition (ASR) network. The SE network learns to output an embedding vector which distinguishes the speaker characteristics of the input utterance, while the ASR network learns to recognize the phonetic context of the input. In training our speaker verification framework, we consider both the triplet loss minimization and adversarial gradient of the ASR network to obtain more discriminative and text-independent speaker embedding vectors. With the triplet loss, the distances between the embedding vectors of the same speaker are minimized while those of different speakers are maximized. Also, with the adversarial gradient of the ASR network, the text-dependency of the speaker embedding vector can be reduced. In the experiments, we evaluated our speaker verification framework using the LibriSpeech and CHiME 2013 dataset, and the evaluation results show that our speaker verification framework shows lower equal error rate and better text-independency compared to the other approaches.

* Will be appeared in INTERSPEECH 2019

Via

Access Paper or Ask Questions