Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhiyuan Tang

Chain of Correction for Full-text Speech Recognition with Large Language Models

Apr 02, 2025

Zhiyuan Tang, Dong Wang, Zhikai Zhou, Yong Liu, Shen Huang, Shidong Shang

Abstract:Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) has gained increased attention due to its potential to correct errors across long contexts and address a broader spectrum of error types, including punctuation restoration and inverse text normalization. Nevertheless, many challenges persist, including issues related to stability, controllability, completeness, and fluency. To mitigate these challenges, this paper proposes the Chain of Correction (CoC) for full-text error correction with LLMs, which corrects errors segment by segment using pre-recognized text as guidance within a regular multi-turn chat format. The CoC also uses pre-recognized full text for context, allowing the model to better grasp global semantics and maintain a comprehensive overview of the entire content. Utilizing the open-sourced full-text error correction dataset ChFT, we fine-tune a pre-trained LLM to evaluate the performance of the CoC framework. Experimental results demonstrate that the CoC effectively corrects errors in full-text ASR outputs, significantly outperforming baseline and benchmark systems. We further analyze how to set the correction threshold to balance under-correction and over-rephrasing, extrapolate the CoC model on extremely long ASR outputs, and investigate whether other types of information can be employed to guide the error correction process.

Via

Access Paper or Ask Questions

Full-text Error Correction for Chinese Speech Recognition with Large Language Model

Sep 12, 2024

Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang

Figure 1 for Full-text Error Correction for Chinese Speech Recognition with Large Language Model

Figure 2 for Full-text Error Correction for Chinese Speech Recognition with Large Language Model

Figure 3 for Full-text Error Correction for Chinese Speech Recognition with Large Language Model

Figure 4 for Full-text Error Correction for Chinese Speech Recognition with Large Language Model

Abstract:Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). However, most research focuses on utterances from short-duration speech recordings, which are the predominant form of speech data for supervised ASR training. This paper investigates the effectiveness of LLMs for error correction in full-text generated by ASR systems from longer speech recordings, such as transcripts from podcasts, news broadcasts, and meetings. First, we develop a Chinese dataset for full-text error correction, named ChFT, utilizing a pipeline that involves text-to-speech synthesis, ASR, and error-correction pair extractor. This dataset enables us to correct errors across contexts, including both full-text and segment, and to address a broader range of error types, such as punctuation restoration and inverse text normalization, thus making the correction process comprehensive. Second, we fine-tune a pre-trained LLM on the constructed dataset using a diverse set of prompts and target formats, and evaluate its performance on full-text error correction. Specifically, we design prompts based on full-text and segment, considering various output formats, such as directly corrected text and JSON-based error-correction pairs. Through various test settings, including homogeneous, up-to-date, and hard test sets, we find that the fine-tuned LLMs perform well in the full-text setting with different prompts, each presenting its own strengths and weaknesses. This establishes a promising baseline for further research. The dataset is available on the website.

Via

Access Paper or Ask Questions

Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models

Jul 02, 2024

Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang

Abstract:Recent studies have demonstrated the efficacy of large language models (LLMs) in error correction for automatic speech recognition (ASR). However, much of the research focuses on the English language. This paper redirects the attention to Chinese. Firstly, we construct a specialized benchmark dataset aimed at error correction for Chinese ASR with 724K hypotheses-transcription pairs, named the Chinese Hypotheses Paradise dataset (ChineseHP), which contains a wide range of scenarios and presents significant challenges. Subsequently, we conduct a preliminary evaluation using the dataset for both direct-prompting and fine-tuning pre-trained LLMs. Furthermore, we propose a straightforward method of Pinyin regularization for prompts, which involves the transcription of Pinyin directly from text hypotheses. The experimental results reveal that Pinyin regularization consistently enhances the error-correcting ability of LLMs when compared with those without regularization. The dataset is available on the website.

* Interspeech 2024

Via

Access Paper or Ask Questions

Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

Apr 26, 2021

Jianwei Sun, Zhiyuan Tang, Hengxin Yin, Wei Wang, Xi Zhao, Shuaijiang Zhao, Xiaoning Lei, Wei Zou, Xiangang Li

Figure 1 for Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

Figure 2 for Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

Figure 3 for Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

Figure 4 for Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

Abstract:End-to-end models have gradually become the preferred option for automatic speech recognition (ASR) applications. During the training of end-to-end ASR, data augmentation is a quite effective technique for regularizing the neural networks. This paper proposes a novel data augmentation technique based on semantic transposition of the transcriptions via syntax rules for end-to-end Mandarin ASR. Specifically, we first segment the transcriptions based on part-of-speech tags. Then transposition strategies, such as placing the object in front of the subject or swapping the subject and the object, are applied on the segmented sentences. Finally, the acoustic features corresponding to the transposed transcription are reassembled based on the audio-to-text forced-alignment produced by a pre-trained ASR system. The combination of original data and augmented one is used for training a new ASR system. The experiments are conducted on the Transformer[2] and Conformer[3] based ASR. The results show that the proposed method can give consistent performance gain to the system. Augmentation related issues, such as comparison of different strategies and ratios for data combination are also investigated.

Via

Access Paper or Ask Questions

Can We Trust Deep Speech Prior?

Nov 04, 2020

Ying Shi, Haolin Chen, Zhiyuan Tang, Lantian Li, Dong Wang, Jiqing Han

Figure 1 for Can We Trust Deep Speech Prior?

Figure 2 for Can We Trust Deep Speech Prior?

Figure 3 for Can We Trust Deep Speech Prior?

Figure 4 for Can We Trust Deep Speech Prior?

Abstract:Recently, speech enhancement (SE) based on deep speech prior has attracted much attention, such as the variational auto-encoder with non-negative matrix factorization (VAE-NMF) architecture. Compared to conventional approaches that represent clean speech by shallow models such as Gaussians with a low-rank covariance, the new approach employs deep generative models to represent the clean speech, which often provides a better prior. Despite the clear advantage in theory, we argue that deep priors must be used with much caution, since the likelihood produced by a deep generative model does not always coincide with the speech quality. We designed a comprehensive study on this issue and demonstrated that based on deep speech priors, a reasonable SE performance can be achieved, but the results might be suboptimal. A careful analysis showed that this problem is deeply rooted in the disharmony between the flexibility of deep generative models and the nature of the maximum-likelihood (ML) training.

* To be published in IEEE SLT 2021

Via

Access Paper or Ask Questions

AP20-OLR Challenge: Three Tasks and Their Baselines

Jun 04, 2020

Zheng Li, Miao Zhao, Qingyang Hong, Lin Li, Zhiyuan Tang, Dong Wang, Liming Song, Cheng Yang

Figure 1 for AP20-OLR Challenge: Three Tasks and Their Baselines

Figure 2 for AP20-OLR Challenge: Three Tasks and Their Baselines

Figure 3 for AP20-OLR Challenge: Three Tasks and Their Baselines

Abstract:This paper introduces the fifth oriental language recognition (OLR) challenge AP20-OLR, which intends to improve the performance of language recognition systems, along with APSIPA Annual Summit and Conference (APSIPA ASC). The data profile, three tasks, the corresponding baselines, and the evaluation principles are introduced in this paper. The AP20-OLR challenge includes more languages, dialects and real-life data provided by Speechocean and the NSFC M2ASR project, and all the data is free for participants. The challenge this year still focuses on practical and challenging problems, with three tasks: (1) cross-channel LID, (2) dialect identification and (3) noisy LID. Based on Kaldi and Pytorch, recipes for i-vector and x-vector systems are also conducted as baselines for the three tasks. These recipes will be online-published, and available for participants to configure LID systems. The baseline results on the three tasks demonstrate that those tasks in this challenge are worth paying more efforts to achieve better performance.

* arXiv admin note: substantial text overlap with arXiv:1907.07626, arXiv:1806.00616, arXiv:1706.09742

Via

Access Paper or Ask Questions

AP19-OLR Challenge: Three Tasks and Their Baselines

Sep 01, 2019

Zhiyuan Tang, Dong Wang, Liming Song

Figure 1 for AP19-OLR Challenge: Three Tasks and Their Baselines

Figure 2 for AP19-OLR Challenge: Three Tasks and Their Baselines

Figure 3 for AP19-OLR Challenge: Three Tasks and Their Baselines

Abstract:This paper introduces the fourth oriental language recognition (OLR) challenge AP19-OLR, including the data profile, the tasks and the evaluation principles. The OLR challenge has been held successfully for three consecutive years, along with APSIPA Annual Summit and Conference (APSIPA ASC). The challenge this year still focuses on practical and challenging tasks, precisely (1) short-utterance LID, (2) cross-channel LID and (3) zero-resource LID. The event this year includes more languages and more real-life data provided by SpeechOcean and the NSFC M2ASR project. All the data is free for participants. Recipes for x-vector system and back-end evaluation are also conducted as baselines for the three tasks. The participants can refer to these online-published recipes to deploy LID systems for convenience. We report the baseline results on the three tasks and demonstrate that the three tasks are worth some efforts to achieve better performance.

* arXiv admin note: substantial text overlap with arXiv:1806.00616, arXiv:1706.09742, arXiv:1609.08445

Via

Access Paper or Ask Questions

Gaussian-Constrained training for speaker verification

Nov 08, 2018

Lantian Li, Zhiyuan Tang, Ying Shi, Dong Wang

Figure 1 for Gaussian-Constrained training for speaker verification

Figure 2 for Gaussian-Constrained training for speaker verification

Figure 3 for Gaussian-Constrained training for speaker verification

Abstract:Neural models, in particular the d-vector and x-vector architectures, have produced state-of-the-art performance on many speaker verification tasks. However, two potential problems of these neural models deserve more investigation. Firstly, both models suffer from `information leak', which means that some parameters participating in model training will be discarded during inference, i.e, the layers that are used as the classifier. Secondly, both models do not regulate the distribution of the derived speaker vectors. This `unconstrained distribution' may degrade the performance of the subsequent scoring component, e.g., PLDA. This paper proposes a Gaussian-constrained training approach that (1) discards the parametric classifier, and (2) enforces the distribution of the derived speaker vectors to be Gaussian. Our experiments on the VoxCeleb and SITW databases demonstrated that this new training approach produced more representative and regular speaker embeddings, leading to consistent performance improvement.

Via

Access Paper or Ask Questions

Phonetic-attention scoring for deep speaker features in speaker verification

Nov 08, 2018

Lantian Li, Zhiyuan Tang, Ying Shi, Dong Wang

Figure 1 for Phonetic-attention scoring for deep speaker features in speaker verification

Figure 2 for Phonetic-attention scoring for deep speaker features in speaker verification

Figure 3 for Phonetic-attention scoring for deep speaker features in speaker verification

Figure 4 for Phonetic-attention scoring for deep speaker features in speaker verification

Abstract:Recent studies have shown that frame-level deep speaker features can be derived from a deep neural network with the training target set to discriminate speakers by a short speech segment. By pooling the frame-level features, utterance-level representations, called d-vectors, can be derived and used in the automatic speaker verification (ASV) task. This simple average pooling, however, is inherently sensitive to the phonetic content of the utterance. An interesting idea borrowed from machine translation is the attention-based mechanism, where the contribution of an input word to the translation at a particular time is weighted by an attention score. This score reflects the relevance of the input word and the present translation. We can use the same idea to align utterances with different phonetic contents. This paper proposes a phonetic-attention scoring approach for d-vector systems. By this approach, an attention score is computed for each frame pair. This score reflects the similarity of the two frames in phonetic content, and is used to weigh the contribution of this frame pair in the utterance-based scoring. This new scoring approach emphasizes the frame pairs with similar phonetic contents, which essentially provides a soft alignment for utterances with any phonetic contents. Experimental results show that compared with the naive average pooling, this phonetic-attention scoring approach can deliver consistent performance improvement in ASV tasks of both text-dependent and text-independent.

* Submitted to ICASSP 2019

Via

Access Paper or Ask Questions

AP18-OLR Challenge: Three Tasks and Their Baselines

Jun 02, 2018

Zhiyuan Tang, Dong Wang, Qing Chen

Figure 1 for AP18-OLR Challenge: Three Tasks and Their Baselines

Figure 2 for AP18-OLR Challenge: Three Tasks and Their Baselines

Figure 3 for AP18-OLR Challenge: Three Tasks and Their Baselines

Abstract:The third oriental language recognition (OLR) challenge AP18-OLR is introduced in this paper, including the data profile, the tasks and the evaluation principles. Following the events in the last two years, namely AP16-OLR and AP17-OLR, the challenge this year focuses on more challenging tasks, including (1) short-duration utterances, (2) confusing languages, and (3) open-set recognition. The same as the previous events, the data of AP18-OLR is also provided by SpeechOcean and the NSFC M2ASR project. Baselines based on both the i-vector model and neural networks are constructed for the participants' reference. We report the baseline results on the three tasks and demonstrate that the three tasks are truly challenging. All the data is free for participants, and the Kaldi recipes for the baselines have been published online.

* arXiv admin note: substantial text overlap with arXiv:1706.09742

Via

Access Paper or Ask Questions