Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yizhou Peng

ED-sKWS: Early-Decision Spiking Neural Networks for Rapid,and Energy-Efficient Keyword Spotting

Jun 14, 2024

Zeyang Song, Qianhui Liu, Qu Yang, Yizhou Peng, Haizhou Li

Abstract:Keyword Spotting (KWS) is essential in edge computing requiring rapid and energy-efficient responses. Spiking Neural Networks (SNNs) are well-suited for KWS for their efficiency and temporal capacity for speech. To further reduce the latency and energy consumption, this study introduces ED-sKWS, an SNN-based KWS model with an early-decision mechanism that can stop speech processing and output the result before the end of speech utterance. Furthermore, we introduce a Cumulative Temporal (CT) loss that can enhance prediction accuracy at both the intermediate and final timesteps. To evaluate early-decision performance, we present the SC-100 dataset including 100 speech commands with beginning and end timestamp annotation. Experiments on the Google Speech Commands v2 and our SC-100 datasets show that ED-sKWS maintains competitive accuracy with 61% timesteps and 52% energy consumption compared to SNN models without early-decision mechanism, ensuring rapid response and energy efficiency.

* Accepted by INTERSPEECH2024

Via

Access Paper or Ask Questions

The NUS-HLT System for ICASSP2024 ICMC-ASR Grand Challenge

Dec 26, 2023

Meng Ge, Yizhou Peng, Yidi Jiang, Jingru Lin, Junyi Ao, Mehmet Sinan Yildirim, Shuai Wang, Haizhou Li, Mengling Feng

Abstract:This paper summarizes our team's efforts in both tracks of the ICMC-ASR Challenge for in-car multi-channel automatic speech recognition. Our submitted systems for ICMC-ASR Challenge include the multi-channel front-end enhancement and diarization, training data augmentation, speech recognition modeling with multi-channel branches. Tested on the offical Eval1 and Eval2 set, our best system achieves a relative 34.3% improvement in CER and 56.5% improvement in cpCER, compared to the offical baseline system.

* Technical Report. 2 pages. For ICMC-ASR-2023 Challenge

Via

Access Paper or Ask Questions

Adapting OpenAI's Whisper for Speech Recognition on Code-Switch Mandarin-English SEAME and ASRU2019 Datasets

Nov 29, 2023

Yuhang Yang, Yizhou Peng, Xionghu Zhong, Hao Huang, Eng Siong Chng

Abstract:This paper details the experimental results of adapting the OpenAI's Whisper model for Code-Switch Mandarin-English Speech Recognition (ASR) on the SEAME and ASRU2019 corpora. We conducted 2 experiments: a) using adaptation data from 1 to 100/200 hours to demonstrate effectiveness of adaptation, b) examining different language ID setup on Whisper prompt. The Mixed Error Rate results show that the amount of adaptation data may be as low as $1\sim10$ hours to achieve saturation in performance gain (SEAME) while the ASRU task continued to show performance with more adaptation data ($>$100 hours). For the language prompt, the results show that although various prompting strategies initially produce different outcomes, adapting the Whisper model with code-switch data uniformly improves its performance. These results may be relevant also to the community when applying Whisper for related tasks of adapting to new target domains.

* 6 pages, 3 figures, 4 tables

Via

Access Paper or Ask Questions

Mutual Information-Based Integrated Sensing and Communications: A WMMSE Framework

Oct 19, 2023

Yizhou Peng, Songjie Yang, Wanting Lyu, Ya Li, Hongjun He, Zhongpei Zhang, Chadi Assi

Abstract:In this letter, a weighted minimum mean square error (WMMSE) empowered integrated sensing and communication (ISAC) system is investigated. One transmitting base station and one receiving wireless access point are considered to serve multiple users a sensing target. Based on the theory of mutual-information (MI), communication MI and sensing MI rate are utilized as the performance metrics under the presence of clutters. In particular, we propose an novel MI-based WMMSE-ISAC method by developing a unique transceiver design mechanism to maximize the weighted sensing and communication sum-rate of this system. Such a maximization process is achieved by utilizing the classical method -- WMMSE, aiming to better manage the effect of sensing clutters and the interference among users. Numerical results show the effectiveness of our proposed method, and the performance trade-off between sensing and communication is also validated.

Via

Access Paper or Ask Questions

Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder

Jul 09, 2022

Jicheng Zhang, Yizhou Peng, Haihua Xu, Yi He, Eng Siong Chng, Hao Huang

Figure 1 for Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder

Figure 2 for Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder

Figure 3 for Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder

Figure 4 for Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder

Abstract:Intermediate layer output (ILO) regularization by means of multitask training on encoder side has been shown to be an effective approach to yielding improved results on a wide range of end-to-end ASR frameworks. In this paper, we propose a novel method to do ILO regularized training differently. Instead of using conventional multitask methods that entail more training overhead, we directly make the intermediate layer output as input to the decoder, that is, our decoder not only accepts the output of the final encoder layer as input, it also takes the output of the encoder ILO as input during training. With the proposed method, as both encoder and decoder are simultaneously "regularized", the network is more sufficiently trained, consistently leading to improved results, over the ILO-based CTC method, as well as over the original attention-based modeling method without the proposed method employed.

* 5 pages. Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition

Jul 09, 2022

Yizhou Peng, Yufei Liu, Jicheng Zhang, Haihua Xu, Yi He, Hao Huang, Eng Siong Chng

Figure 1 for Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition

Figure 2 for Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition

Figure 3 for Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition

Figure 4 for Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition

Abstract:Internal Language Model Estimation (ILME) based language model (LM) fusion has been shown significantly improved recognition results over conventional shallow fusion in both intra-domain and cross-domain speech recognition tasks. In this paper, we attempt to apply our ILME method to cross-domain code-switching speech recognition (CSSR) work. Specifically, our curiosity comes from several aspects. First, we are curious about how effective the ILME-based LM fusion is for both intra-domain and cross-domain CSSR tasks. We verify this with or without merging two code-switching domains. More importantly, we train an end-to-end (E2E) speech recognition model by means of merging two monolingual data sets and observe the efficacy of the proposed ILME-based LM fusion for CSSR. Experimental results on SEAME that is from Southeast Asian and another Chinese Mainland CS data set demonstrate the effectiveness of the proposed ILME-based LM fusion method.

* 5 pages. Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Minimum word error training for non-autoregressive Transformer-based code-switching ASR

Oct 07, 2021

Yizhou Peng, Jicheng Zhang, Haihua Xu, Hao Huang, Eng Siong Chng

Figure 1 for Minimum word error training for non-autoregressive Transformer-based code-switching ASR

Figure 2 for Minimum word error training for non-autoregressive Transformer-based code-switching ASR

Figure 3 for Minimum word error training for non-autoregressive Transformer-based code-switching ASR

Figure 4 for Minimum word error training for non-autoregressive Transformer-based code-switching ASR

Abstract:Non-autoregressive end-to-end ASR framework might be potentially appropriate for code-switching recognition task thanks to its inherent property that present output token being independent of historical ones. However, it still under-performs the state-of-the-art autoregressive ASR frameworks. In this paper, we propose various approaches to boosting the performance of a CTC-mask-based nonautoregressive Transformer under code-switching ASR scenario. To begin with, we attempt diversified masking method that are closely related with code-switching point, yielding an improved baseline model. More importantly, we employ MinimumWord Error (MWE) criterion to train the model. One of the challenges is how to generate a diversified hypothetical space, so as to obtain the average loss for a given ground truth. To address such a challenge, we explore different approaches to yielding desired N-best-based hypothetical space. We demonstrate the efficacy of the proposed methods on SEAME corpus, a challenging English-Mandarin code-switching corpus for Southeast Asia community. Compared with the crossentropy-trained strong baseline, the proposed MWE training method achieves consistent performance improvement on the test sets.

* Submit to ICASSP 2021

Via

Access Paper or Ask Questions

E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition

Jun 15, 2021

Jicheng Zhang, Yizhou Peng, Pham Van Tung, Haihua Xu, Hao Huang, Eng Siong Chng

Figure 1 for E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition

Figure 2 for E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition

Figure 3 for E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition

Figure 4 for E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition

Abstract:In this paper, we propose a single multi-task learning framework to perform End-to-End (E2E) speech recognition (ASR) and accent recognition (AR) simultaneously. The proposed framework is not only more compact but can also yield comparable or even better results than standalone systems. Specifically, we found that the overall performance is predominantly determined by the ASR task, and the E2E-based ASR pretraining is essential to achieve improved performance, particularly for the AR task. Additionally, we conduct several analyses of the proposed method. First, though the objective loss for the AR task is much smaller compared with its counterpart of ASR task, a smaller weighting factor with the AR task in the joint objective function is necessary to yield better results for each task. Second, we found that sharing only a few layers of the encoder yields better AR results than sharing the overall encoder. Experimentally, the proposed method produces WER results close to the best standalone E2E ASR ones, while it achieves 7.7% and 4.2% relative improvement over standalone and single-task-based joint recognition methods on test set for accent recognition respectively.

Via

Access Paper or Ask Questions