Abstract:We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis. Given audio recordings from 2-4 microphones and the 3D geometry and material of a scene containing multiple unknown sound sources, we estimate the sound anywhere in the scene. We identify the main challenges of novel-view acoustic synthesis as sound source localization, separation, and dereverberation. While naively training an end-to-end network fails to produce high-quality results, we show that incorporating room impulse responses (RIRs) derived from 3D reconstructed rooms enables the same network to jointly tackle these tasks. Our method outperforms existing methods designed for the individual tasks, demonstrating its effectiveness at utilizing 3D visual information. In a simulated study on the Matterport3D-NVAS dataset, our model achieves near-perfect accuracy on source localization, a PSNR of 26.44 dB and a SDR of 14.23 dB for source separation and dereverberation, resulting in a PSNR of 25.55 dB and a SDR of 14.20 dB on novel-view acoustic synthesis. Code, pretrained model, and video results are available on the project webpage (https://github.com/apple/ml-nvas3d).
Abstract:Air absorption is an important effect to consider when simulating room acoustics as it leads to significant attenuation in high frequencies. In this study, an offline method for adding air absorption to simulated room impulse responses is devised. The proposed method is based on a modal scheme for a system of one-dimensional dissipative wave equations, which can be used to post-process a room impulse response simulated without air absorption, thereby incorporating missing frequency-dependent distance-based air attenuation. Numerical examples are presented to evaluate the proposed method, along with comparisons to existing filter-based methods.