Abstract:The "cocktail party" problem of fully separating multiple sources from a single channel audio waveform remains unsolved. Current biological understanding of neural encoding suggests that phase information is preserved and utilized at every stage of the auditory pathway. However, current computational approaches primarily discard phase information in order to mask amplitude spectrograms of sound. In this paper, we seek to address whether preserving phase information in spectral representations of sound provides better results in monaural separation of vocals from a musical track by using a neurally plausible sparse generative model. Our results demonstrate that preserving phase information reduces artifacts in the separated tracks, as quantified by the signal to artifact ratio (GSAR). Furthermore, our proposed method achieves state-of-the-art performance for source separation, as quantified by a mean signal to interference ratio (GSIR) of 19.46.
Abstract:Neural networks are analogous in many ways to spin glasses, systems which are known for their rich set of dynamics and equally complex phase diagrams. We apply well-known techniques in the study of spin glasses to a convolutional sparsely encoding neural network and observe power law finite-size scaling behavior in the sparsity and reconstruction error as the network denoises 32$\times$32 RGB CIFAR-10 images. This finite-size scaling indicates the presence of a continuous phase transition at a critical value of this sparsity. By using the power law scaling relations inherent to finite-size scaling, we can determine the optimal value of sparsity for any network size by tuning the system to the critical point and operate the system at the minimum denoising error.