Abstract:The estimation of room impulse responses (RIRs) between static loudspeaker and microphone locations can be done using a number of well-established measurement and inference procedures. While these procedures assume a time-invariant acoustic system, time variations need to be considered for the case of spatially dynamic scenarios where loudspeakers and microphones are subject to movement. If the RIR is modeled using image sources, then movement implies that the distance to each image source varies over time, making the estimation of the spatially dynamic RIR particularly challenging. In this paper, we propose a procedure to estimate the early part of the spatially dynamic RIR between a stationary source and a microphone moving on a linear trajectory at constant velocity. The procedure is built upon a state-space model, where the state to be estimated represents the early RIR, the observation corresponds to a microphone recording in a spatially dynamic scenario, and time-varying distances to the image sources are incorporated into the state transition matrix obtained from static RIRs at the start and end point of the trajectory. The performance of the proposed approach is evaluated against state-of-the-art RIR interpolation and state-space estimation methods using simulations, demonstrating the potential of the proposed state-space model.
Abstract:Reverberation may severely degrade the quality of speech signals recorded using microphones in a room. For compact microphone arrays, the choice of the reference microphone for multi-microphone dereverberation typically does not have a large influence on the dereverberation performance. In contrast, when the microphones are spatially distributed, the choice of the reference microphone may significantly contribute to the dereverberation performance. In this paper, we propose to perform reference microphone selection for the weighted prediction error (WPE) dereverberation algorithm based on the normalized $\ell_p$-norm of the dereverberated output signal. Experimental results for different source positions in a reverberant laboratory show that the proposed method yields a better dereverberation performance than reference microphone selection based on the early-to-late reverberation ratio or signal power.
Abstract:The identification of siren sounds in urban soundscapes is a crucial safety aspect for smart vehicles and has been widely addressed by means of neural networks that ensure robustness to both the diversity of siren signals and the strong and unstructured background noise characterizing traffic. Convolutional neural networks analyzing spectrogram features of incoming signals achieve state-of-the-art performance when enough training data capturing the diversity of the target acoustic scenes is available. In practice, data is usually limited and algorithms should be robust to adapt to unseen acoustic conditions without requiring extensive datasets for re-training. In this work, given the harmonic nature of siren signals, characterized by a periodically evolving fundamental frequency, we propose a low-complexity feature extraction method based on frequency tracking using a single-parameter adaptive notch filter. The features are then used to design a small-scale convolutional network suitable for training with limited data. The evaluation results indicate that the proposed model consistently outperforms the traditional spectrogram-based model when limited training data is available, achieves better cross-domain generalization and has a smaller size.
Abstract:A one-shot algorithm called iterationless DANSE (iDANSE) is introduced to perform distributed adaptive node-specific signal estimation (DANSE) in a fully connected wireless acoustic sensor network (WASN) deployed in an environment with non-overlapping latent signal subspaces. The iDANSE algorithm matches the performance of a centralized algorithm in a single processing cycle while devices exchange fused versions of their multichannel local microphone signals. Key advantages of iDANSE over currently available solutions are its iterationless nature, which favors deployment in real-time applications, and the fact that devices can exchange fewer fused signals than the number of latent sources in the environment. The proposed method is validated in numerical simulations including a speech enhancement scenario.
Abstract:A low-rank approximation-based version of the topology-independent distributed adaptive node-specific signal estimation (TI-DANSE) algorithm is introduced, using a generalized eigenvalue decomposition (GEVD) for application in ad-hoc wireless acoustic sensor networks. This TI-GEVD-DANSE algorithm as well as the original TI-DANSE algorithm exhibit a non-strict convergence, which can lead to numerical instability over time, particularly in scenarios where the estimation of accurate spatial covariance matrices is challenging. An adaptive filter coefficient normalization strategy is proposed to mitigate this issue and enable the stable performance of TI-(GEVD-)DANSE. The method is validated in numerical simulations including dynamic acoustic scenarios, demonstrating the importance of the additional normalization.
Abstract:In many speech recording applications, the recorded desired speech is corrupted by both noise and acoustic echo, such that combined noise reduction (NR) and acoustic echo cancellation (AEC) is called for. A common cascaded design corresponds to NR filters preceding AEC filters. These NR filters aim at reducing the near-end room noise (and possibly partially the echo) and operate on the microphones only, consequently requiring the AEC filters to model both the echo paths and the NR filters. In this paper, however, we propose a design with extended NR (NRext) filters preceding AEC filters under the assumption of the echo paths being additive maps, thus preserving the addition operation. Here, the NRext filters aim at reducing both the near-end room noise and the far-end room noise component in the echo, and operate on both the microphones and loudspeakers. We show that the succeeding AEC filters remarkably become independent of the NRext filters, such that the AEC filters are only required to model the echo paths, improving the AEC performance. Further, the degrees of freedom in the NRext filters scale with the number of loudspeakers, which is not the case for the NR filters, resulting in an improved NR performance.
Abstract:In the last three decades, the Steered Response Power (SRP) method has been widely used for the task of Sound Source Localization (SSL), due to its satisfactory localization performance on moderately reverberant and noisy scenarios. Many works have analyzed and extended the original SRP method to reduce its computational cost, to allow it to locate multiple sources, or to improve its performance in adverse environments. In this work, we review over 200 papers on the SRP method and its variants, with emphasis on the SRP-PHAT method. We also present eXtensible-SRP, or X-SRP, a generalized and modularized version of the SRP algorithm which allows the reviewed extensions to be implemented. We provide a Python implementation of the algorithm which includes selected extensions from the literature.
Abstract:Steered Response Power (SRP) is a widely used method for the task of sound source localization using microphone arrays, showing satisfactory localization performance on many practical scenarios. However, its performance is diminished under highly reverberant environments. Although Deep Neural Networks (DNNs) have been previously proposed to overcome this limitation, most are trained for a specific number of microphones with fixed spatial coordinates. This restricts their practical application on scenarios frequently observed in wireless acoustic sensor networks, where each application has an ad-hoc microphone topology. We propose Neural-SRP, a DNN which combines the flexibility of SRP with the performance gains of DNNs. We train our network using simulated data and transfer learning, and evaluate our approach on recorded and simulated data. Results verify that Neural-SRP's localization performance significantly outperforms the baselines.
Abstract:In the design of traffic monitoring solutions for optimizing the urban mobility infrastructure, acoustic vehicle counting models have received attention due to their cost effectiveness and energy efficiency. Although deep learning has proven effective for visual traffic monitoring, its use has not been thoroughly investigated in the audio domain, likely due to real-world data scarcity. In this work, we propose a novel approach to acoustic vehicle counting by developing: i) a traffic noise simulation framework to synthesize realistic vehicle pass-by events; ii) a strategy to mix synthetic and real data to train a deep-learning model for traffic counting. The proposed system is capable of simultaneously counting cars and commercial vehicles driving on a two-lane road, and identifying their direction of travel under moderate traffic density conditions. With only 24 hours of labeled real-world traffic noise, we are able to improve counting accuracy on real-world data from $63\%$ to $88\%$ for cars and from $86\%$ to $94\%$ for commercial vehicles.
Abstract:Reverberation can severely degrade the quality of speech signals recorded using microphones in an enclosure. In acoustic sensor networks with spatially distributed microphones, a similar dereverberation performance may be achieved using only a subset of all available microphones. Using the popular convex relaxation method, in this paper we propose to perform microphone subset selection for the weighted prediction error (WPE) multi-channel dereverberation algorithm by introducing a group sparsity penalty on the prediction filter coefficients. The resulting problem is shown to be solved efficiently using the accelerated proximal gradient algorithm. Experimental evaluation using measured impulse responses shows that the performance of the proposed method is close to the optimal performance obtained by exhaustive search, both for frequency-dependent as well as frequency-independent microphone subset selection. Furthermore, the performance using only a few microphones for frequency-independent microphone subset selection is only marginally worse than using all available microphones.