Abstract:In this paper, we introduce a novel general speech restoration model: the Dual-path Magnitude (DM) network, designed to address multiple distortions including noise, reverberation, and bandwidth degradation effectively. The DM network employs dual parallel magnitude decoders that share parameters: one uses a masking-based algorithm for distortion removal and the other employs a mapping-based approach for speech restoration. A novel aspect of the DM network is the integration of the magnitude spectrogram output from the masking decoder into the mapping decoder through a skip connection, enhancing the overall restoration capability. This integrated approach overcomes the inherent limitations observed in previous models, as detailed in a step-by-step analysis. The experimental results demonstrate that the DM network outperforms other baseline models in the comprehensive aspect of general speech restoration, achieving substantial restoration with fewer parameters.
Abstract:This report describes the submission of HYU ASML team to the IEEE Signal Processing Cup 2024 (SP Cup 2024). This challenge, titled "ROBOVOX: Far-Field Speaker Recognition by a Mobile Robot," focuses on speaker recognition using a mobile robot in noisy and reverberant conditions. Our solution combines the result of deep residual neural networks and time-delay neural network-based speaker embedding models. These models were trained on a diverse dataset that includes French speech. To account for the challenging evaluation environment characterized by high noise, reverberation, and short speech conditions, we focused on data augmentation and training speech duration for the speaker embedding model. Our submission achieved second place on the SP Cup 2024 public leaderboard, with a detection cost function of 0.5245 and an equal error rate of 6.46%.
Abstract:Transformer-based end-to-end neural speaker diarization (EEND) models utilize the multi-head self-attention (SA) mechanism to enable accurate speaker label prediction in overlapped speech regions. In this study, to enhance the training effectiveness of SA-EEND models, we propose the use of auxiliary losses for the SA heads of the transformer layers. Specifically, we assume that the attention weight matrices of an SA layer are redundant if their patterns are similar to those of the identity matrix. We then explicitly constrain such matrices to exhibit specific speaker activity patterns relevant to voice activity detection or overlapped speech detection tasks. Consequently, we expect the proposed auxiliary losses to guide the transformer layers to exhibit more diverse patterns in the attention weights, thereby reducing the assumed redundancies in the SA heads. The effectiveness of the proposed method is demonstrated using the simulated and CALLHOME datasets for two-speaker diarization tasks, reducing the diarization error rate of the conventional SA-EEND model by 32.58% and 17.11%, respectively.
Abstract:In acoustic scene classification (ASC) task, an acoustic scene consists of diverse attributes and is inferred by identifying combinations of some distinct attributes among them. This study aims to extract and cluster these attributes effectively using a multiple-instance learning (MIL) framework for ASC. MIL, known as one of the weakly supervised learning methods, is a way to extract instances from input data and infer a scene corresponding to the input data with those unlabeled instances. We develop the MIL framework more suitable for ASC systems, adopting instance-level labels and instance-level loss, which are effective in extracting and clustering instances. As a result, the witness rate increases significantly compared to the framework without instance-level loss and labels. Also in several MIL-based ASC systems, the classification accuracy improves by about 5 to 11% than without instance-level loss. In addition, we designed a fully separated convolutional module which is a low-complexity neural network consisting of pointwise, frequency-sided depthwise, and temporal-sided depthwise convolutional filters. Considering both complexity and performance, our proposed system is more practical compared to previous systems on the DCASE 2019 challenge task 1-A leader board. We surpassed the third-place model by achieving a performance of 82.3\% with only the model complexity of 417K, which is at least 40 times fewer than other systems.
Abstract:Developing a single-microphone speech denoising or dereverberation front-end for robust automatic speaker verification (ASV) in noisy far-field speaking scenarios is challenging. To address this problem, we present a novel front-end design that involves a recently proposed extension of the weighted prediction error (WPE) speech dereverberation algorithm, the virtual acoustic channel expansion (VACE)-WPE. It is demonstrated experimentally in this study that unlike the conventional WPE algorithm, the VACE-WPE can be explicitly trained to cancel out both late reverberation and background noise. To build the front-end, the VACE-WPE is first independently (pre)trained to produce "noisy" dereverberated signals. Subsequently, given a pretrained speaker embedding model, the VACE-WPE is additionally fine-tuned within a task-specific optimization (TSO) framework, causing the speaker embedding extracted from the processed signal to be similar to that extracted from the "noise-free" target signal. Moreover, to extend the application of the proposed front-end to more general, unconstrained "in-the-wild" ASV scenarios beyond controlled far-field conditions, we propose a distortion regularization method for the VACE-WPE within the TSO framework. The effectiveness of the proposed approach is verified on both far-field and in-the-wild ASV benchmarks, demonstrating its superiority over fully neural front-ends and other TSO methods in various cases.
Abstract:The remarkable performance of the pre-trained language model (LM) using self-supervised learning has led to a major paradigm shift in the study of natural language processing. In line with these changes, leveraging the performance of speech recognition systems with massive deep learning-based LMs is a major topic of speech recognition research. Among the various methods of applying LMs to speech recognition systems, in this paper, we focus on a cross-modal knowledge distillation method that transfers knowledge between two types of deep neural networks with different modalities. We propose an acoustic model structure with multiple auxiliary output layers for cross-modal distillation and demonstrate that the proposed method effectively compensates for the shortcomings of the existing label-interpolation-based distillation method. In addition, we extend the proposed method to a hierarchical distillation method using LMs trained in different units (senones, monophones, and subwords) and reveal the effectiveness of the hierarchical distillation method through an ablation study.
Abstract:Attribution methods calculate attributions that visually explain the predictions of deep neural networks (DNNs) by highlighting important parts of the input features. In particular, gradient-based attribution (GBA) methods are widely used because they can be easily implemented through automatic differentiation. In this study, we use the attributions that filter out irrelevant parts of the input features and then verify the effectiveness of this approach by measuring the classification accuracy of a pre-trained DNN. This is achieved by calculating and applying an \textit{attribution mask} to the input features and subsequently introducing the masked features to the DNN, for which the mask is designed to recursively focus attention on the parts of the input related to the target label. The accuracy is enhanced under a certain condition, i.e., \textit{no implicit bias}, which can be derived based on our theoretical insight into compressing the DNN into a single-layer neural network. We also provide Gradient\,*\,Sign-of-Input (GxSI) to obtain the attribution mask that further improves the accuracy. As an example, on CIFAR-10 that is modified using the attribution mask obtained from GxSI, we achieve the accuracy ranging from 99.8\% to 99.9\% without additional training.
Abstract:Speech dereverberation is an important issue for many real-world speech processing applications. Among the techniques developed, the weighted prediction error (WPE) algorithm has been widely adopted and advanced over the last decade, which blindly cancels out the late reverberation component from the reverberant mixture of microphone signals. In this study, we extend the neural-network-based virtual acoustic channel expansion (VACE) framework for the WPE-based speech dereverberation, a variant of the WPE that we recently proposed to enable the use of dual-channel WPE algorithm in a single-microphone speech dereverberation scenario. Based on the previous study, some ablation studies are conducted regarding the constituents of the VACE-WPE in an offline processing scenario. These studies help understand the dynamics of the system, thereby simplifying the architecture and leading to the introduction of new strategies for training the neural network for the VACE. Experimental results in noisy reverberant environments reveal that VACE-WPE considerably outperforms its single-channel counterpart in terms of objective speech quality and is complementary to the single-channel WPE when employed as the front-end for the far-field automatic speech recognizer.
Abstract:Distant speech recognition is a challenge, particularly due to the corruption of speech signals by reverberation caused by large distances between the speaker and microphone. In order to cope with a wide range of reverberations in real-world situations, we present novel approaches for acoustic modeling including an ensemble of deep neural networks (DNNs) and an ensemble of jointly trained DNNs. First, multiple DNNs are established, each of which corresponds to a different reverberation time 60 (RT60) in a setup step. Also, each model in the ensemble of DNN acoustic models is further jointly trained, including both feature mapping and acoustic modeling, where the feature mapping is designed for the dereverberation as a front-end. In a testing phase, the two most likely DNNs are chosen from the DNN ensemble using maximum a posteriori (MAP) probabilities, computed in an online fashion by using maximum likelihood (ML)-based blind RT60 estimation and then the posterior probability outputs from two DNNs are combined using the ML-based weights as a simple average. Extensive experiments demonstrate that the proposed approach leads to substantial improvements in speech recognition accuracy over the conventional DNN baseline systems under diverse reverberant conditions.