Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chien-yu Huang

Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition

Nov 27, 2024

Shih-heng Wang, Jiatong Shi, Chien-yu Huang, Shinji Watanabe, Hung-yi Lee

Figure 1 for Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition

Figure 2 for Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition

Figure 3 for Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition

Figure 4 for Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition

Abstract:Self-supervised learning (SSL) models have shown exceptional capabilities across various speech-processing tasks. Continuous SSL representations are effective but suffer from high computational and storage demands. On the other hand, discrete SSL representations, although with degraded performance, reduce transmission and storage costs, and improve input sequence efficiency through de-duplication and subword-modeling. To boost the performance of discrete representations for ASR, we introduce a novel fusion mechanism that integrates two discrete representations. The fusion mechanism preserves all the benefits of discrete representation while enhancing the model's performance by integrating complementary information. Additionally, we explore "self-augmented'' discrete representations, which apply transformations to a single continuous SSL representation, eliminating the fusion mechanism's dependency on multiple SSL models and further decreasing its inference costs. Experimental results on benchmarks, including LibriSpeech and ML-SUPERB, indicate up to 19% and 24% relative character error rate improvement compared with the non-fusion baseline, validating the effectiveness of our proposed methods.

* SLT 2024

Via

Access Paper or Ask Questions

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Nov 08, 2024

Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T. Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi(+68 more)

Figure 1 for Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Figure 2 for Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Figure 3 for Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Figure 4 for Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Abstract:Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results indicate that none of the models performed well universally. SALMONN-13B excelled in English ASR, while WavLLM demonstrated high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We will soon open-source all task data and the evaluation pipeline.

Via

Access Paper or Ask Questions

SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning

Aug 25, 2024

Chien-yu Huang, Min-Han Shih, Ke-Han Lu, Chi-Yuan Hsiao, Hung-yi Lee

Figure 1 for SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning

Figure 2 for SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning

Figure 3 for SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning

Figure 4 for SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning

Abstract:Instruction-based speech processing is becoming popular. Studies show that training with multiple tasks boosts performance, but collecting diverse, large-scale tasks and datasets is expensive. Thus, it is highly desirable to design a fundamental task that benefits other downstream tasks. This paper introduces a multi-talker speaking style captioning task to enhance the understanding of speaker and prosodic information. We used large language models to generate descriptions for multi-talker speech. Then, we trained our model with pre-training on this captioning task followed by instruction tuning. Evaluation on Dynamic-SUPERB shows our model outperforming the baseline pre-trained only on single-talker tasks, particularly in speaker and emotion recognition. Additionally, tests on a multi-talker QA task reveal that current models struggle with attributes such as gender, pitch, and speaking rate. The code and dataset are available at https://github.com/cyhuang-tw/speechcaps.

* SynData4GenAI 2024

Via

Access Paper or Ask Questions

Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech Model

Oct 04, 2023

Kai-Wei Chang, Ming-Hsin Chen, Yun-Ping Lin, Jing Neng Hsu, Paul Kuo-Ming Huang, Chien-yu Huang, Shang-Wen Li, Hung-yi Lee

Abstract:Prompting and adapter tuning have emerged as efficient alternatives to fine-tuning (FT) methods. However, existing studies on speech prompting focused on classification tasks and failed on more complex sequence generation tasks. Besides, adapter tuning is primarily applied with a focus on encoder-only self-supervised models. Our experiments show that prompting on Wav2Seq, a self-supervised encoder-decoder model, surpasses previous works in sequence generation tasks. It achieves a remarkable 53% relative improvement in word error rate for ASR and a 27% in F1 score for slot filling. Additionally, prompting competes with the FT method in the low-resource scenario. Moreover, we show the transferability of prompting and adapter tuning on Wav2Seq in cross-lingual ASR. When limited trainable parameters are involved, prompting and adapter tuning consistently outperform conventional FT across 7 languages. Notably, in the low-resource scenario, prompting consistently outperforms adapter tuning.

* Accepted to IEEE ASRU 2023

Via

Access Paper or Ask Questions

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

Sep 18, 2023

Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng(+5 more)

Figure 1 for Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

Figure 2 for Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

Figure 3 for Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

Figure 4 for Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

Abstract:Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion. To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark. To initiate, Dynamic-SUPERB features 55 evaluation instances by combining 33 tasks and 22 datasets. This spans a broad spectrum of dimensions, providing a comprehensive platform for evaluation. Additionally, we propose several approaches to establish benchmark baselines. These include the utilization of speech models, text language models, and the multimodal encoder. Evaluation results indicate that while these baselines perform reasonably on seen tasks, they struggle with unseen ones. We also conducted an ablation study to assess the robustness and seek improvements in the performance. We release all materials to the public and welcome researchers to collaborate on the project, advancing technologies in the field together.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Toward Degradation-Robust Voice Conversion

Oct 14, 2021

Chien-yu Huang, Kai-Wei Chang, Hung-yi Lee

Figure 1 for Toward Degradation-Robust Voice Conversion

Figure 2 for Toward Degradation-Robust Voice Conversion

Figure 3 for Toward Degradation-Robust Voice Conversion

Figure 4 for Toward Degradation-Robust Voice Conversion

Abstract:Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training. Although there have been several state-of-the-art any-to-any voice conversion models, they were all based on clean utterances to convert successfully. However, in real-world scenarios, it is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations. It thus becomes highly desired to understand how these degradations affect voice conversion and build a degradation-robust model. We report in this paper the first comprehensive study on the degradation robustness of any-to-any voice conversion. We show that the performance of state-of-the-art models nowadays was severely hampered given degraded utterances. To this end, we then propose speech enhancement concatenation and denoising training to improve the robustness. In addition to common degradations, we also consider adversarial noises, which alter the model output significantly yet are human-imperceptible. It was shown that both concatenations with off-the-shelf speech enhancement models and denoising training on voice conversion models could improve the robustness, while each of them had pros and cons.

* Submitted to ICASSP 2022, equal contribution from first two authors

Via

Access Paper or Ask Questions

Improving Cross-Lingual Reading Comprehension with Self-Training

May 08, 2021

Wei-Cheng Huang, Chien-yu Huang, Hung-yi Lee

Figure 1 for Improving Cross-Lingual Reading Comprehension with Self-Training

Figure 2 for Improving Cross-Lingual Reading Comprehension with Self-Training

Figure 3 for Improving Cross-Lingual Reading Comprehension with Self-Training

Figure 4 for Improving Cross-Lingual Reading Comprehension with Self-Training

Abstract:Substantial improvements have been made in machine reading comprehension, where the machine answers questions based on a given context. Current state-of-the-art models even surpass human performance on several benchmarks. However, their abilities in the cross-lingual scenario are still to be explored. Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension. In this paper, we further utilized unlabeled data to improve the performance. The model is first supervised-trained on source language corpus, and then self-trained with unlabeled target language data. The experiment results showed improvements for all languages, and we also analyzed how self-training benefits cross-lingual reading comprehension in qualitative aspects.

* 8 pages, 4 figures

Via

Access Paper or Ask Questions

Utilizing Self-supervised Representations for MOS Prediction

Apr 21, 2021

Wei-Cheng Tseng, Chien-yu Huang, Wei-Tsung Kao, Yist Y. Lin, Hung-yi Lee

Figure 1 for Utilizing Self-supervised Representations for MOS Prediction

Figure 2 for Utilizing Self-supervised Representations for MOS Prediction

Figure 3 for Utilizing Self-supervised Representations for MOS Prediction

Figure 4 for Utilizing Self-supervised Representations for MOS Prediction

Abstract:Speech quality assessment has been a critical issue in speech processing for decades. Existing automatic evaluations usually require clean references or parallel ground truth data, which is infeasible when the amount of data soars. Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception. However, such a test is expensive and time-consuming because crowd work is necessary. It thus becomes highly desired to develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data. In this paper, we use self-supervised pre-trained models for MOS prediction. We show their representations can distinguish between clean and noisy audios. Then, we fine-tune these pre-trained models followed by simple linear layers in an end-to-end manner. The experiment results showed that our framework outperforms the two previous state-of-the-art models by a significant improvement on Voice Conversion Challenge 2018 and achieves comparable or superior performance on Voice Conversion Challenge 2016. We also conducted an ablation study to further investigate how each module benefits the task. The experiment results are implemented and reproducible with publicly available toolkits.

* Submitted to Interspeech 2021. We acknowledge the support of AWS Machine Learning Research Awards program

Via

Access Paper or Ask Questions

Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

Mar 20, 2021

Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, Hung-yi Lee

Figure 1 for Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

Figure 2 for Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

Figure 3 for Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

Figure 4 for Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

Abstract:The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples. In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. Among different types of embeddings, the embedding pretrained by voice conversion achieves the best performance. The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers and achieved 2nd place in the one-shot track of the ICASSP 2021 M2VoC challenge.

* Accepted by ICASSP 2021, in the special session of ICASSP 2021 M2VoC Challenge

Via

Access Paper or Ask Questions

Defending Your Voice: Adversarial Attack on Voice Conversion

May 18, 2020

Chien-yu Huang, Yist Y. Lin, Hung-yi Lee, Lin-shan Lee

Figure 1 for Defending Your Voice: Adversarial Attack on Voice Conversion

Figure 2 for Defending Your Voice: Adversarial Attack on Voice Conversion

Figure 3 for Defending Your Voice: Adversarial Attack on Voice Conversion

Figure 4 for Defending Your Voice: Adversarial Attack on Voice Conversion

Abstract:Substantial improvements have been achieved in recent years in voice conversion, which converts the speaker characteristics of an utterance into those of another speaker without changing the linguistic content of the utterance. Nonetheless, the improved conversion technologies also led to concerns about privacy and authentication. It thus becomes highly desired to be able to prevent one's voice from being improperly utilized with such voice conversion technologies. This is why we report in this paper the first known attempt to try to perform adversarial attack on voice conversion. We introduce human imperceptible noise into the utterances of a speaker whose voice is to be defended. Given these adversarial examples, voice conversion models cannot convert other utterances so as to sound like being produced by the defended speaker. Preliminary experiments were conducted on two currently state-of-the-art zero-shot voice conversion models. Objective and subjective evaluation results in both white-box and black-box scenarios are reported. It was shown that the speaker characteristics of the converted utterances were made obviously different from those of the defended speaker, while the adversarial examples of the defended speaker are not distinguishable from the authentic utterances.

* Submitted to INTERSPEECH 2020

Via

Access Paper or Ask Questions