Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haoxu Wang

ZipEnhancer: Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement

Jan 09, 2025

Haoxu Wang, Biao Tian

Abstract:In contrast to other sequence tasks modeling hidden layer features with three axes, Dual-Path time and time-frequency domain speech enhancement models are effective and have low parameters but are computationally demanding due to their hidden layer features with four axes. We propose ZipEnhancer, which is Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement, incorporating time and frequency domain Down-Up sampling to reduce computational costs. We introduce the ZipformerBlock as the core block and propose the design of the Dual-Path DownSampleStacks that symmetrically scale down and scale up. Also, we introduce the ScaleAdam optimizer and Eden learning rate scheduler to improve the performance further. Our model achieves new state-of-the-art results on the DNS 2020 Challenge and Voicebank+DEMAND datasets, with a perceptual evaluation of speech quality (PESQ) of 3.69 and 3.63, using 2.04M parameters and 62.41G FLOPS, outperforming other methods with similar complexity levels.

* Accepted by ICASSP 2025

Via

Access Paper or Ask Questions

Memory-Efficient and Secure DNN Inference on TrustZone-enabled Consumer IoT Devices

Mar 19, 2024

Xueshuo Xie, Haoxu Wang, Zhaolong Jian, Tao Li, Wei Wang, Zhiwei Xu, Guiling Wang

Figure 1 for Memory-Efficient and Secure DNN Inference on TrustZone-enabled Consumer IoT Devices

Figure 2 for Memory-Efficient and Secure DNN Inference on TrustZone-enabled Consumer IoT Devices

Figure 3 for Memory-Efficient and Secure DNN Inference on TrustZone-enabled Consumer IoT Devices

Figure 4 for Memory-Efficient and Secure DNN Inference on TrustZone-enabled Consumer IoT Devices

Abstract:Edge intelligence enables resource-demanding Deep Neural Network (DNN) inference without transferring original data, addressing concerns about data privacy in consumer Internet of Things (IoT) devices. For privacy-sensitive applications, deploying models in hardware-isolated trusted execution environments (TEEs) becomes essential. However, the limited secure memory in TEEs poses challenges for deploying DNN inference, and alternative techniques like model partitioning and offloading introduce performance degradation and security issues. In this paper, we present a novel approach for advanced model deployment in TrustZone that ensures comprehensive privacy preservation during model inference. We design a memory-efficient management method to support memory-demanding inference in TEEs. By adjusting the memory priority, we effectively mitigate memory leakage risks and memory overlap conflicts, resulting in 32 lines of code alterations in the trusted operating system. Additionally, we leverage two tiny libraries: S-Tinylib (2,538 LoCs), a tiny deep learning library, and Tinylibm (827 LoCs), a tiny math library, to support efficient inference in TEEs. We implemented a prototype on Raspberry Pi 3B+ and evaluated it using three well-known lightweight DNN models. The experimental results demonstrate that our design significantly improves inference speed by 3.13 times and reduces power consumption by over 66.5% compared to non-memory optimization method in TEEs.

Via

Access Paper or Ask Questions

Robust Wake Word Spotting With Frame-Level Cross-Modal Attention Based Audio-Visual Conformer

Mar 04, 2024

Haoxu Wang, Ming Cheng, Qiang Fu, Ming Li

Abstract:In recent years, neural network-based Wake Word Spotting achieves good performance on clean audio samples but struggles in noisy environments. Audio-Visual Wake Word Spotting (AVWWS) receives lots of attention because visual lip movement information is not affected by complex acoustic scenes. Previous works usually use simple addition or concatenation for multi-modal fusion. The inter-modal correlation remains relatively under-explored. In this paper, we propose a novel module called Frame-Level Cross-Modal Attention (FLCMA) to improve the performance of AVWWS systems. This module can help model multi-modal information at the frame-level through synchronous lip movements and speech signals. We train the end-to-end FLCMA based Audio-Visual Conformer and further improve the performance by fine-tuning pre-trained uni-modal models for the AVWWS task. The proposed system achieves a new state-of-the-art result (4.57% WWS score) on the far-field MISP dataset.

* Accepted by ICASSP 2024

Via

Access Paper or Ask Questions

LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition

Jan 12, 2024

Fan Yu, Haoxu Wang, Xian Shi, Shiliang Zhang

Abstract:The growing prevalence of online conferences and courses presents a new challenge in improving automatic speech recognition (ASR) with enriched textual information from video slides. In contrast to rare phrase lists, the slides within videos are synchronized in real-time with the speech, enabling the extraction of long contextual bias. Therefore, we propose a novel long-context biasing network (LCB-net) for audio-visual speech recognition (AVSR) to leverage the long-context information available in videos effectively. Specifically, we adopt a bi-encoder architecture to simultaneously model audio and long-context biasing. Besides, we also propose a biasing prediction module that utilizes binary cross entropy (BCE) loss to explicitly determine biased phrases in the long-context biasing. Furthermore, we introduce a dynamic contextual phrases simulation to enhance the generalization and robustness of our LCB-net. Experiments on the SlideSpeech, a large-scale audio-visual corpus enriched with slides, reveal that our proposed LCB-net outperforms general ASR model by 9.4%/9.1%/10.9% relative WER/U-WER/B-WER reduction on test set, which enjoys high unbiased and biased performance. Moreover, we also evaluate our model on LibriSpeech corpus, leading to 23.8%/19.2%/35.4% relative WER/U-WER/B-WER reduction over the ASR model.

* Accepted by ICASPP 2024

Via

Access Paper or Ask Questions

Hourglass-AVSR: Down-Up Sampling-based Computational Efficiency Model for Audio-Visual Speech Recognition

Dec 14, 2023

Fan Yu, Haoxu Wang, Ziyang Ma, Shiliang Zhang

Abstract:Recently audio-visual speech recognition (AVSR), which better leverages video modality as additional information to extend automatic speech recognition (ASR), has shown promising results in complex acoustic environments. However, there is still substantial space to improve as complex computation of visual modules and ineffective fusion of audio-visual modalities. To eliminate these drawbacks, we propose a down-up sampling-based AVSR model (Hourglass-AVSR) to enjoy high efficiency and performance, whose time length is scaled during the intermediate processing, resembling an hourglass. Firstly, we propose a context and residual aware video upsampling approach to improve the recognition performance, which utilizes contextual information from visual representations and captures residual information between adjacent video frames. Secondly, we introduce a visual-audio alignment approach during the upsampling by explicitly incorporating boundary constraint loss. Besides, we propose a cross-layer attention fusion to capture the modality dependencies within each visual encoder layer. Experiments conducted on the MISP-AVSR dataset reveal that our proposed Hourglass-AVSR model outperforms ASR model by 12.9% and 20.8% relative concatenated minimum permutation character error rate (cpCER) reduction on far-field and middle-field test sets, respectively. Moreover, compared to other state-of-the-art AVSR models, our model exhibits the highest improvement in cpCER for the visual module. Furthermore, on the benefit of our down-up sampling approach, Hourglass-AVSR model reduces 54.2% overall computation costs with minor performance degradation.

* Accepted by ICASSP 2024

Via

Access Paper or Ask Questions

SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus

Sep 12, 2023

Haoxu Wang, Fan Yu, Xian Shi, Yuezhang Wang, Shiliang Zhang, Ming Li

Abstract:Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional modalities to improve the performance of speech recognition systems. While existing approaches primarily focus on video or contextual information, the utilization of extra supplementary textual information has been overlooked. Recognizing the abundance of online conference videos with slides, which provide rich domain-specific information in the form of text and images, we release SlideSpeech, a large-scale audio-visual corpus enriched with slides. The corpus contains 1,705 videos, 1,000+ hours, with 473 hours of high-quality transcribed speech. Moreover, the corpus contains a significant amount of real-time synchronized slides. In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. Through the application of keyword extraction and contextual ASR methods in the benchmark system, we demonstrate the potential of improving speech recognition performance by incorporating textual information from supplementary video slides.

Via

Access Paper or Ask Questions

The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

Mar 04, 2023

Haoxu Wang, Ming Cheng, Qiang Fu, Ming Li

Figure 1 for The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

Figure 2 for The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

Figure 3 for The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

Figure 4 for The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

Abstract:This paper further explores our previous wake word spotting system ranked 2-nd in Track 1 of the MISP Challenge 2021. First, we investigate a robust unimodal approach based on 3D and 2D convolution and adopt the simple attention module (SimAM) for our system to improve performance. Second, we explore different combinations of data augmentation methods for better performance. Finally, we study the fusion strategies, including score-level, cascaded and neural fusion. Our proposed multimodal system leverages multimodal features and uses the complementary visual information to mitigate the performance degradation of audio-only systems in complex acoustic scenarios. Our system obtains a false reject rate of 2.15% and a false alarm rate of 3.44% in the evaluation set of the competition database, which achieves the new state-of-the-art performance by 21% relative improvement compared to previous systems.

* Accepted by ICASSP 2023

Via

Access Paper or Ask Questions

Generating Adversarial Samples For Training Wake-up Word Detection Systems Against Confusing Words

Jan 01, 2022

Haoxu Wang, Yan Jia, Zeqing Zhao, Xuyang Wang, Junjie Wang, Ming Li

Figure 1 for Generating Adversarial Samples For Training Wake-up Word Detection Systems Against Confusing Words

Figure 2 for Generating Adversarial Samples For Training Wake-up Word Detection Systems Against Confusing Words

Figure 3 for Generating Adversarial Samples For Training Wake-up Word Detection Systems Against Confusing Words

Figure 4 for Generating Adversarial Samples For Training Wake-up Word Detection Systems Against Confusing Words

Abstract:Wake-up word detection models are widely used in real life, but suffer from severe performance degradation when encountering adversarial samples. In this paper we discuss the concept of confusing words in adversarial samples. Confusing words are commonly encountered, which are various kinds of words that sound similar to the predefined keywords. To enhance the wake word detection system's robustness against confusing words, we propose several methods to generate the adversarial confusing samples for simulating real confusing words scenarios in which we usually do not have any real confusing samples in the training set. The generated samples include concatenated audio, synthesized data, and partially masked keywords. Moreover, we use a domain embedding concatenated system to improve the performance. Experimental results show that the adversarial samples generated in our approach help improve the system's robustness in both the common scenario and the confusing words scenario. In addition, we release the confusing words testing database called HI-MIA-CW for future research.

* arXiv admin note: substantial text overlap with arXiv:2011.01460

Via

Access Paper or Ask Questions