Abstract:Multi-speaker localization and tracking using microphone array recording is of importance in a wide range of applications. One of the challenges with multi-speaker tracking is to associate direction estimates with the correct speaker. Most existing association approaches rely on spatial or spectral information alone, leading to performance degradation when one of these information channels is partially known or missing. This paper studies a joint probability data association (JPDA)-based method that facilitates association based on joint spatial-spectral information. This is achieved by integrating speaker time-frequency (TF) masks, estimated based on spectral information, in the association probabilities calculation. An experimental study that tested the proposed method on recordings from the LOCATA challenge demonstrates the enhanced performance obtained by using joint spatial-spectral information in the association.
Abstract:Direction-of-arrival estimation of multiple speakers in a room is an important task for a wide range of applications. In particular, challenging environments with moving speakers, reverberation and noise, lead to significant performance degradation for current methods. With the aim of better understanding factors affecting performance and improving current methods, in this paper multi-speaker direction-of-arrival (DOA) estimation is investigated using a modified version of the local space domain distance (LSDD) algorithm in a noisy, dynamic and reverberant environment employing a wearable microphone array. This study utilizes the recently published EasyCom speech dataset, recorded using a wearable microphone array mounted on eyeglasses. While the original LSDD algorithm demonstrates strong performance in static environments, its efficacy significantly diminishes in the dynamic settings of the EasyCom dataset. Several enhancements to the LSDD algorithm are developed following a comprehensive performance and system analysis, which enable improved DOA estimation under these challenging conditions. These improvements include incorporating a weighted reliability approach and introducing a new quality measure that reliably identifies the more accurate DOA estimates, thereby enhancing both the robustness and accuracy of the algorithm in challenging environments.
Abstract:The increasing popularity of spatial audio in applications such as teleconferencing, entertainment, and virtual reality has led to the recent developments of binaural reproduction methods. However, only a few of these methods are well-suited for wearable and mobile arrays, which typically consist of a small number of microphones. One such method is binaural signal matching (BSM), which has been shown to produce high-quality binaural signals for wearable arrays. However, BSM may be suboptimal in cases of high direct-to-reverberant ratio (DRR) as it is based on the diffuse sound field assumption. To overcome this limitation, previous studies incorporated sound-field models other than diffuse. However, this approach was not studied comprehensively. This paper extensively investigates two BSM-based methods designed for high DRR scenarios. The methods incorporate a sound field model composed of direct and reverberant components.The methods are investigated both mathematically and using simulations, finally validated by a listening test. The results show that the proposed methods can significantly improve the performance of BSM , in particular in the direction of the source, while presenting only a negligible degradation in other directions. Furthermore, when source direction estimation is inaccurate, performance of these methods degrade to equal that of the BSM, presenting a desired robustness quality.
Abstract:This study investigates the approach of direction-dependent selection of Head-Related Transfer Functions (HRTFs) and its impact on sound localization accuracy. For applications such as virtual reality (VR) and teleconferencing, obtaining individualized HRTFs can be beneficial yet challenging, the objective of this work is therefore to assess whether incorporating HRTFs in a direction-dependent manner could improve localization precision without the need to obtain individualized HRTFs. A localization experiment conducted with a VR headset assessed localization errors, comparing an overall best HRTF from a set, against selecting the best HRTF based on average performance in each direction. The results demonstrate a substantial improvement in elevation localization error with the method motivated by direction-dependent HRTF selection, while revealing insignificant differences in azimuth errors.
Abstract:Binaural reproduction is rapidly becoming a topic of great interest in the research community, especially with the surge of new and popular devices, such as virtual reality headsets, smart glasses, and head-tracked headphones. In order to immerse the listener in a virtual or remote environment with such devices, it is essential to generate realistic and accurate binaural signals. This is challenging, especially since the microphone arrays mounted on these devices are typically composed of an arbitrarily-arranged small number of microphones, which impedes the use of standard audio formats like Ambisonics, and provides limited spatial resolution. The binaural signal matching (BSM) method was developed recently to overcome these challenges. While it produced binaural signals with low error using relatively simple arrays, its performance degraded significantly when head rotation was introduced. This paper aims to develop the BSM method further and overcome its limitations. For this purpose, the method is first analyzed in detail, and a design framework that guarantees accurate binaural reproduction for relatively complex acoustic environments is presented. Next, it is shown that the BSM accuracy may significantly degrade at high frequencies, and thus, a perceptually motivated extension to the method is proposed, based on a magnitude least-squares (MagLS) formulation. These insights and developments are then analyzed with the help of an extensive simulation study of a simple six-microphone semi-circular array. It is further shown that the BSM-MagLS method can be very useful in compensating for head rotations with this array. Finally, a listening experiment is conducted with a four-microphone array on a pair of glasses in a reverberant speech environment and including head rotations, where it is shown that BSM-MagLS can indeed produce binaural signals with a high perceived quality.
Abstract:Binaural reproduction for headphone-centric listening has become a focal point in ongoing research, particularly within the realm of advancing technologies such as augmented and virtual reality (AR and VR). The demand for high-quality spatial audio in these applications is essential to uphold a seamless sense of immersion. However, challenges arise from wearable recording devices equipped with only a limited number of microphones and irregular microphone placements due to design constraints. These factors contribute to limited reproduction quality compared to reference signals captured by high-order microphone arrays. This paper introduces a novel optimization loss tailored for a beamforming-based, signal-independent binaural reproduction scheme. This method, named iMagLS-BSM incorporates an interaural level difference (ILD) error term into the previously proposed binaural signal matching (BSM) magnitude least squares (MagLS) rendering loss for lateral plane angles. The method leverages nonlinear programming to minimize the introduced loss. Preliminary results show a substantial reduction in ILD error, while maintaining a binaural magnitude error comparable to that achieved with a MagLS BSM solution. These findings hold promise for enhancing the overall spatial quality of resultant binaural signals.
Abstract:High fidelity spatial audio often performs better when produced using a personalized head-related transfer function (HRTF). However, the direct acquisition of HRTFs is cumbersome and requires specialized equipment. Thus, many personalization methods estimate HRTF features from easily obtained anthropometric features of the pinna, head, and torso. The first HRTF notch frequency (N1) is known to be a dominant feature in elevation localization, and thus a useful feature for HRTF personalization. This paper describes the prediction of N1 frequency from pinna anthropometry using a neural model. Prediction is performed separately on three databases, both simulated and measured, and then by domain mixing in-between the databases. The model successfully predicts N1 frequency for individual databases and by domain mixing between some databases. Prediction errors are better or comparable to those previously reported, showing significant improvement when acquired over a large database and with a larger output range.
Abstract:Ambisonics, a popular format of spatial audio, is the spherical harmonic (SH) representation of the plane wave density function of a sound field. Many algorithms operate in the SH domain and utilize the Ambisonics as their input signal. The process of encoding Ambisonics from a spherical microphone array involves dividing by the radial functions, which may amplify noise at low frequencies. This can be overcome by regularization, with the downside of introducing errors to the Ambisonics encoding. This paper aims to investigate the impact of different ways of regularization on Deep Neural Network (DNN) training and performance. Ideally, these networks should be robust to the way of regularization. Simulated data of a single speaker in a room and experimental data from the LOCATA challenge were used to evaluate this robustness on an example algorithm of speaker localization based on the direct-path dominance (DPD) test. Results show that performance may be sensitive to the way of regularization, and an informed approach is proposed and investigated, highlighting the importance of regularization information.
Abstract:In the rapidly evolving fields of virtual and augmented reality, accurate spatial audio capture and reproduction are essential. For these applications, Ambisonics has emerged as a standard format. However, existing methods for encoding Ambisonics signals from arbitrary microphone arrays face challenges, such as errors due to the irregular array configurations and limited spatial resolution resulting from a typically small number of microphones. To address these limitations and challenges, a mathematical framework for studying Ambisonics encoding is presented, highlighting the importance of incorporating the full steering function, and providing a novel measure for predicting the accuracy of encoding each Ambisonics channel from the steering functions alone. Furthermore, novel residual channels are formulated supplementing the Ambisonics channels. A simulation study for several array configurations demonstrates a reduction in binaural error for this approach.
Abstract:Spatial analysis of room acoustics is an ongoing research topic. Microphone arrays have been employed for spatial analyses with an important objective being the estimation of the direction-of-arrival (DOA) of direct sound and early room reflections using room impulse responses (RIRs). An optimal method for DOA estimation is the multiple signal classification algorithm. When RIRs are considered, this method typically fails due to the correlation of room reflections, which leads to rank deficiency of the cross-spectrum matrix. Preprocessing methods for rank restoration, which may involve averaging over frequency, for example, have been proposed exclusively for spherical arrays. However, these methods fail in the case of reflections with equal time delays, which may arise in practice and could be of interest. In this paper, a method is proposed for systems that combine a spherical microphone array and a spherical loudspeaker array, referred to as multiple-input multiple-output systems. This method, referred to as modal smoothing, exploits the additional spatial diversity for rank restoration and succeeds where previous methods fail, as demonstrated in a simulation study. Finally, combining modal smoothing with a preprocessing method is proposed in order to increase the number of DOAs that can be estimated using low-order spherical loudspeaker arrays.