Abstract:With the goal of more natural and human-like interaction with virtual voice assistants, recent research in the field has focused on full duplex interaction mode without relying on repeated wake-up words. This requires that in scenes with complex sound sources, the voice assistant must classify utterances as device-oriented or non-device-oriented. The dual-encoder structure, which is jointly modeled by text and speech, has become the paradigm of device-directed speech detection. However, in practice, these models often produce incorrect predictions for unaligned input pairs due to the unavoidable errors of automatic speech recognition (ASR).To address this challenge, we propose M$^{3}$V, a multi-modal multi-view approach for device-directed speech detection, which frames we frame the problem as a multi-view learning task that introduces unimodal views and a text-audio alignment view in the network besides the multi-modal. Experimental results show that M$^{3}$V significantly outperforms models trained using only single or multi-modality and surpasses human judgment performance on ASR error data for the first time.
Abstract:Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality representations for downstream tasks. Moreover, in many domains, collecting enough language-audio pairs is extremely hard, and transcribing raw audio also requires high professional skills, making it difficult or even infeasible to joint pre-training. To address these painpoints, we propose DSCLAP, a simple and effective framework that enables language-audio pre-training with only raw audio signal input. Specifically, DSCLAP converts raw audio signals into text via an ASR system and combines a contrastive learning objective and a language-audio matching objective to align the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of in-vehicle domain audio. Empirical results on two downstream tasks show that while conceptually simple, DSCLAP significantly outperforms the baseline models in all metrics, showing great promise for domain-specific IVAs applications.
Abstract:Contrastive learning has become one of the most impressive approaches for multi-modal representation learning. However, previous multi-modal works mainly focused on cross-modal understanding, ignoring in-modal contrastive learning, which limits the representation of each modality. In this paper, we propose a novel contrastive learning strategy, called $Turbo$, to promote multi-modal understanding by joint in-modal and cross-modal contrastive learning. Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality. With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training. Finally, we combine the self-supervised Turbo with the supervised multi-modal classification and demonstrate its effectiveness on two audio-text classification tasks, where the state-of-the-art performance is achieved on a speech emotion recognition benchmark dataset.
Abstract:Imaging through scattering media is a challenging problem owing to speckle decorrelations from perturbations in the media itself. For in-line imaging modalities, which are appealing because they are compact, require no moving parts, and are robust, negating the effects of such scattering becomes particularly challenging. Here we explore the effect of stationary scattering media on light scattering in in-line geometries, including digital holographic microscopy. We consider various object-scatterer scenarios where the object is distorted or obscured by additional stationary scatterers, and use an advanced deep learning (DL) generative methodology, generative adversarial networks (GANs), to mitigate the effects of the additional scatterers. Using light scattering simulations and experiments on objects of interest with and without additional scatterers, we find that conditional GANs can be quickly trained with minuscule datasets and can also efficiently learn the one-to-one statistical mapping between the cross-domain input-output image pairs. Training such a network yields a standalone model, that can be used later to inverse or negate the effect of scattering, yielding clear object reconstructions for object retrieval and downstream processing. Moreover, it is well-known that the coherent point spread function (c-PSF) of a stationary scattering optical system is a speckle pattern which is spatially shift variant. We show that with rapid training using only 20 image pairs, it is possible to negate this undesired scattering to accurately localize diffraction-limited impulses with high spatial accuracy, therefore transforming the earlier shift variant system to a linear shift invariant (LSI) system.