Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ronald M. Aarts

Categorical Unsupervised Variational Acoustic Clustering

Apr 10, 2025

Luan Vinícius Fiorio, Ivana Nikoloska, Ronald M. Aarts

Abstract:We propose a categorical approach for unsupervised variational acoustic clustering of audio data in the time-frequency domain. The consideration of a categorical distribution enforces sharper clustering even when data points strongly overlap in time and frequency, which is the case for most datasets of urban acoustic scenes. To this end, we use a Gumbel-Softmax distribution as a soft approximation to the categorical distribution, allowing for training via backpropagation. In this settings, the softmax temperature serves as the main mechanism to tune clustering performance. The results show that the proposed model can obtain impressive clustering performance for all considered datasets, even when data points strongly overlap in time and frequency.

Via

Access Paper or Ask Questions

Hybrid Real- and Complex-valued Neural Network Architecture

Apr 04, 2025

Alex Young, Luan Vinícius Fiorio, Bo Yang, Boris Karanov, Wim van Houtum, Ronald M. Aarts

Abstract:We propose a \emph{hybrid} real- and complex-valued \emph{neural network} (HNN) architecture, designed to combine the computational efficiency of real-valued processing with the ability to effectively handle complex-valued data. We illustrate the limitations of using real-valued neural networks (RVNNs) for inherently complex-valued problems by showing how it learnt to perform complex-valued convolution, but with notable inefficiencies stemming from its real-valued constraints. To create the HNN, we propose to use building blocks containing both real- and complex-valued paths, where information between domains is exchanged through domain conversion functions. We also introduce novel complex-valued activation functions, with higher generalisation and parameterisation efficiency. HNN-specific architecture search techniques are described to navigate the larger solution space. Experiments with the AudioMNIST dataset demonstrate that the HNN reduces cross-entropy loss and consumes less parameters compared to an RVNN for all considered cases. Such results highlight the potential for the use of partially complex-valued processing in neural networks and applications for HNNs in many signal processing domains.

Via

Access Paper or Ask Questions

Unsupervised Variational Acoustic Clustering

Mar 24, 2025

Luan Vinícius Fiorio, Bruno Defraene, Johan David, Frans Widdershoven, Wim van Houtum, Ronald M. Aarts

Abstract:We propose an unsupervised variational acoustic clustering model for clustering audio data in the time-frequency domain. The model leverages variational inference, extended to an autoencoder framework, with a Gaussian mixture model as a prior for the latent space. Specifically designed for audio applications, we introduce a convolutional-recurrent variational autoencoder optimized for efficient time-frequency processing. Our experimental results considering a spoken digits dataset demonstrate a significant improvement in accuracy and clustering performance compared to traditional methods, showcasing the model's enhanced ability to capture complex audio patterns.

Via

Access Paper or Ask Questions

Target Speaker Selection for Neural Network Beamforming in Multi-Speaker Scenarios

Mar 24, 2025

Luan Vinícius Fiorio, Bruno Defraene, Johan David, Alex Young, Frans Widdershoven, Wim van Houtum, Ronald M. Aarts

Abstract:We propose a speaker selection mechanism (SSM) for the training of an end-to-end beamforming neural network, based on recent findings that a listener usually looks to the target speaker with a certain undershot angle. The mechanism allows the neural network model to learn toward which speaker to focus, during training, in a multi-speaker scenario, based on the position of listener and speakers. However, only audio information is necessary during inference. We perform acoustic simulations demonstrating the feasibility and performance when the SSM is employed in training. The results show significant increase in speech intelligibility, quality, and distortion metrics when compared to the minimum variance distortionless filter and the same neural network model trained without SSM. The success of the proposed method is a significant step forward toward the solution of the cocktail party problem.

Via

Access Paper or Ask Questions

Spectral Masking with Explicit Time-Context Windowing for Neural Network-Based Monaural Speech Enhancement

Aug 28, 2024

Luan Vinícius Fiorio, Boris Karanov, Bruno Defraene, Johan David, Wim van Houtum, Frans Widdershoven, Ronald M. Aarts

Figure 1 for Spectral Masking with Explicit Time-Context Windowing for Neural Network-Based Monaural Speech Enhancement

Figure 2 for Spectral Masking with Explicit Time-Context Windowing for Neural Network-Based Monaural Speech Enhancement

Figure 3 for Spectral Masking with Explicit Time-Context Windowing for Neural Network-Based Monaural Speech Enhancement

Abstract:We propose and analyze the use of an explicit time-context window for neural network-based spectral masking speech enhancement to leverage signal context dependencies between neighboring frames. In particular, we concentrate on soft masking and loss computed on the time-frequency representation of the reconstructed speech. We show that the application of a time-context windowing function at both input and output of the neural network model improves the soft mask estimation process by combining multiple estimates taken from different contexts. The proposed approach is only applied as post-optimization in inference mode, not requiring additional layers or special training for the neural network model. Our results show that the method consistently increases both intelligibility and signal quality of the denoised speech, as demonstrated for two classes of convolutional-based speech enhancement models. Importantly, the proposed method requires only a negligible ($\leq1\%$) increase in the number of model parameters, making it suitable for hardware-constrained applications.

* This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions