Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hoirin Kim

Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis

Jan 12, 2025

Minu Kim, Kangwook Jang, Hoirin Kim

Figure 1 for Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis

Figure 2 for Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis

Figure 3 for Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis

Figure 4 for Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis

Abstract:This paper examines how linguistic similarity affects cross-lingual phonetic representation in speech processing for low-resource languages, emphasizing effective source language selection. Previous cross-lingual research has used various source languages to enhance performance for the target low-resource language without thorough consideration of selection. Our study stands out by providing an in-depth analysis of language selection, supported by a practical approach to assess phonetic proximity among multiple language families. We investigate how within-family similarity impacts performance in multilingual training, which aids in understanding language dynamics. We also evaluate the effect of using phonologically similar languages, regardless of family. For the phoneme recognition task, utilizing phonologically similar languages consistently achieves a relative improvement of 55.6% over monolingual training, even surpassing the performance of a large-scale self-supervised learning model. Multilingual training within the same language family demonstrates that higher phonological similarity enhances performance, while lower similarity results in degraded performance compared to monolingual training.

* 10 pages, 5 figures, accepted to ICASSP 2025

Via

Access Paper or Ask Questions

Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition

Jul 04, 2024

Sungnyun Kim, Kangwook Jang, Sangmin Bae, Hoirin Kim, Se-Young Yun

Abstract:Audio-visual speech recognition (AVSR) aims to transcribe human speech using both audio and video modalities. In practical environments with noise-corrupted audio, the role of video information becomes crucial. However, prior works have primarily focused on enhancing audio features in AVSR, overlooking the importance of video features. In this study, we strengthen the video features by learning three temporal dynamics in video data: context order, playback direction, and the speed of video frames. Cross-modal attention modules are introduced to enrich video features with audio information so that speech variability can be taken into account when training on the video temporal dynamics. Based on our approach, we achieve the state-of-the-art performance on the LRS2 and LRS3 AVSR benchmarks for the noise-dominant settings. Our approach excels in scenarios especially for babble and speech noise, indicating the ability to distinguish the speech signal that should be recognized from lip movements in the video modality. We support the validity of our methodology by offering the ablation experiments for the temporal dynamics losses and the cross-modal attention architecture design.

Via

Access Paper or Ask Questions

One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection

Jun 24, 2024

Hyun Myung Kim, Kangwook Jang, Hoirin Kim

Abstract:As speech synthesis systems continue to make remarkable advances in recent years, the importance of robust deepfake detection systems that perform well in unseen systems has grown. In this paper, we propose a novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations. Our approach uses only bonafide samples to define their centroid, which can yield a specialized centroid for one-class learning. Integrating our ACS with one-class learning gathers bonafide representations into a single cluster, forming well-separated embeddings robust to unseen spoofing attacks. Our proposed method achieves an equal error rate (EER) of 2.19% on the ASVspoof 2021 deepfake dataset, outperforming all existing systems. Furthermore, the t-SNE visualization illustrates that our method effectively maps the bonafide embeddings into a single cluster and successfully disentangles the bonafide and spoof classes.

* Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models

Dec 14, 2023

Kangwook Jang, Sungnyun Kim, Hoirin Kim

Abstract:Albeit great performance of Transformer-based speech selfsupervised learning (SSL) models, their large parameter size and computational cost make them unfavorable to utilize. In this study, we propose to compress the speech SSL models by distilling speech temporal relation (STaR). Unlike previous works that directly match the representation for each speech frame, STaR distillation transfers temporal relation between speech frames, which is more suitable for lightweight student with limited capacity. We explore three STaR distillation objectives and select the best combination as the final STaR loss. Our model distilled from HuBERT BASE achieves an overall score of 79.8 on SUPERB benchmark, the best performance among models with up to 27 million parameters. We show that our method is applicable across different speech SSL models and maintains robust performance with further reduced parameters.

* Accepted at ICASSP 2024

Via

Access Paper or Ask Questions

Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation

May 19, 2023

Kangwook Jang, Sungnyun Kim, Se-Young Yun, Hoirin Kim

Abstract:Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key and query parameters while retaining the number of layers. Furthermore, we propose a novel masking distillation strategy to improve the student model's speech representation quality. We extend the distillation loss to utilize both masked and unmasked speech frames to fully leverage the teacher model's high-quality representation. Our universal compression strategy yields the student model that achieves phoneme error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB benchmark.

* Interspeech 2023. Code URL: https://github.com/sungnyun/ARMHuBERT

Via

Access Paper or Ask Questions

Deep Metric Learning with Adaptive Margin and Adaptive Scale for Acoustic Word Discrimination

Oct 26, 2022

Myunghun Jung, Hoirin Kim

Abstract:Many recent loss functions in deep metric learning are expressed with logarithmic and exponential forms, and they involve margin and scale as essential hyper-parameters. Since each data class has an intrinsic characteristic, several previous works have tried to learn embedding space close to the real distribution by introducing adaptive margins. However, there was no work on adaptive scales at all. We argue that both margin and scale should be adaptively adjustable during the training. In this paper, we propose a method called Adaptive Margin and Scale (AdaMS), where hyper-parameters of margin and scale are replaced with learnable parameters of adaptive margins and adaptive scales for each class. Our method is evaluated on Wall Street Journal dataset, and we achieve outperforming results for word discrimination tasks.

* Submitted to ICASSP2023

Via

Access Paper or Ask Questions

FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Jul 01, 2022

Yeonghyeon Lee, Kangwook Jang, Jahyun Goo, Youngmoon Jung, Hoirin Kim

Figure 1 for FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Figure 2 for FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Figure 3 for FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Figure 4 for FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Abstract:Large-scale speech self-supervised learning (SSL) has emerged to the main field of speech processing, however, the problem of computational cost arising from its vast size makes a high entry barrier to academia. In addition, existing distillation techniques of speech SSL models compress the model by reducing layers, which induces performance degradation in linguistic pattern recognition tasks such as phoneme recognition (PR). In this paper, we propose FitHuBERT, which makes thinner in dimension throughout almost all model components and deeper in layer compared to prior speech SSL distillation works. Moreover, we employ a time-reduction layer to speed up inference time and propose a method of hint-based distillation for less performance degradation. Our method reduces the model to 23.8% in size and 35.9% in inference time compared to HuBERT. Also, we achieve 12.1% word error rate and 13.3% phoneme error rate on the SUPERB benchmark which is superior than prior work.

* Accepted to Interspeech 2022

Via

Access Paper or Ask Questions

Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck

Apr 04, 2022

Youngsik Eom, Yeonghyeon Lee, Ji Sub Um, Hoirin Kim

Figure 1 for Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck

Figure 2 for Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck

Figure 3 for Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck

Figure 4 for Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck

Abstract:Recent advances in sophisticated synthetic speech generated from text-to-speech (TTS) or voice conversion (VC) systems cause threats to the existing automatic speaker verification (ASV) systems. Since such synthetic speech is generated from diverse algorithms, generalization ability with using limited training data is indispensable for a robust anti-spoofing system. In this work, we propose a transfer learning scheme based on the wav2vec 2.0 pretrained model with variational information bottleneck (VIB) for speech anti-spoofing task. Evaluation on the ASVspoof 2019 logical access (LA) database shows that our method improves the performance of distinguishing unseen spoofed and genuine speech, outperforming current state-of-the-art anti-spoofing systems. Furthermore, we show that the proposed system improves performance in low-resource and cross-dataset settings of anti-spoofing task significantly, demonstrating that our system is also robust in terms of data size and data distribution.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Asymmetric Proxy Loss for Multi-View Acoustic Word Embeddings

Mar 30, 2022

Myunghun Jung, Hoirin Kim

Figure 1 for Asymmetric Proxy Loss for Multi-View Acoustic Word Embeddings

Figure 2 for Asymmetric Proxy Loss for Multi-View Acoustic Word Embeddings

Figure 3 for Asymmetric Proxy Loss for Multi-View Acoustic Word Embeddings

Figure 4 for Asymmetric Proxy Loss for Multi-View Acoustic Word Embeddings

Abstract:Acoustic word embeddings (AWEs) are discriminative representations of speech segments, and learned embedding space reflects the phonetic similarity between words. With multi-view learning, where text labels are considered as supplementary input, AWEs are jointly trained with acoustically grounded word embeddings (AGWEs). In this paper, we expand the multi-view approach into a proxy-based framework for deep metric learning by equating AGWEs with proxies. A simple modification in computing the similarity matrix allows the general pair weighting to formulate the data-to-proxy relationship. Under the systematized framework, we propose an asymmetric-proxy loss that combines different parts of loss functions asymmetrically while keeping their merits. It follows the assumptions that the optimal function for anchor-positive pairs may differ from one for anchor-negative pairs, and a proxy may have a different impact when it substitutes for different positions in the triplet. We present comparative experiments with various proxy-based losses including our asymmetric-proxy loss, and evaluate AWEs and AGWEs for word discrimination tasks on WSJ corpus. The results demonstrate the effectiveness of the proposed method.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Perceptually Guided End-to-End Text-to-Speech

Nov 02, 2020

Yeunju Choi, Youngmoon Jung, Youngjoo Suh, Hoirin Kim

Figure 1 for Perceptually Guided End-to-End Text-to-Speech

Figure 2 for Perceptually Guided End-to-End Text-to-Speech

Figure 3 for Perceptually Guided End-to-End Text-to-Speech

Abstract:Several fast text-to-speech (TTS) models have been proposed for real-time processing, but there is room for improvement in speech quality. Meanwhile, there is a mismatch between the loss function for training and the mean opinion score (MOS) for evaluation, which may limit the speech quality of TTS models. In this work, we propose a method that can improve the speech quality of a fast TTS model while maintaining the inference speed. To do so, we train a TTS model using a perceptual loss based on the predicted MOS. Under the supervision of a MOS prediction model, a TTS model can learn to increase the perceptual quality of speech directly. In experiments, we train FastSpeech on our internal Korean dataset using the MOS prediction model pre-trained on the Voice Conversion Challenge 2018 evaluation results. The MOS test results show that our proposed approach outperforms FastSpeech in speech quality.

* 5 pages, 1 figure, submitted to ICASSP 2021

Via

Access Paper or Ask Questions