Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Weipeng He

Idiap Research Institute

Directional Source Separation for Robust Speech Recognition on Smart Glasses

Sep 20, 2023

Tiantian Feng, Ju Lin, Yiteng Huang, Weipeng He, Kaustubh Kalgaonkar, Niko Moritz, Li Wan, Xin Lei, Ming Sun, Frank Seide

Figure 1 for Directional Source Separation for Robust Speech Recognition on Smart Glasses

Figure 2 for Directional Source Separation for Robust Speech Recognition on Smart Glasses

Figure 3 for Directional Source Separation for Robust Speech Recognition on Smart Glasses

Figure 4 for Directional Source Separation for Robust Speech Recognition on Smart Glasses

Abstract:Modern smart glasses leverage advanced audio sensing and machine learning technologies to offer real-time transcribing and captioning services, considerably enriching human experiences in daily communications. However, such systems frequently encounter challenges related to environmental noises, resulting in degradation to speech recognition and speaker change detection. To improve voice quality, this work investigates directional source separation using the multi-microphone array. We first explore multiple beamformers to assist source separation modeling by strengthening the directional properties of speech signals. In addition to relying on predetermined beamformers, we investigate neural beamforming in multi-channel source separation, demonstrating that automatic learning directional characteristics effectively improves separation quality. We further compare the ASR performance leveraging separated outputs to noisy inputs. Our results show that directional source separation benefits ASR for the wearer but not for the conversation partner. Lastly, we perform the joint training of the directional source separation and ASR model, achieving the best overall ASR performance.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Egocentric Audio-Visual Noise Suppression

Nov 07, 2022

Roshan Sharma, Weipeng He, Ju Lin, Egor Lakomkin, Yang Liu, Kaustubh Kalgaonkar

Figure 1 for Egocentric Audio-Visual Noise Suppression

Figure 2 for Egocentric Audio-Visual Noise Suppression

Figure 3 for Egocentric Audio-Visual Noise Suppression

Figure 4 for Egocentric Audio-Visual Noise Suppression

Abstract:This paper studies audio-visual suppression for egocentric videos -- where the speaker is not captured in the video. Instead, potential noise sources are visible on screen with the camera emulating the off-screen speaker's view of the outside world. This setting is different from prior work in audio-visual speech enhancement that relies on lip and facial visuals. In this paper, we first demonstrate that egocentric visual information is helpful for noise suppression. We compare object recognition and action classification based visual feature extractors, and investigate methods to align audio and visual representations. Then, we examine different fusion strategies for the aligned features, and locations within the noise suppression model to incorporate visual information. Experiments demonstrate that visual features are most helpful when used to generate additive correction masks. Finally, in order to ensure that the visual features are discriminative with respect to different noise types, we introduce a multi-task learning framework that jointly optimizes audio-visual noise suppression and video based acoustic event detection. This proposed multi-task framework outperforms the audio only baseline on all metrics, including a 0.16 PESQ improvement. Extensive ablations reveal the improved performance of the proposed model with multiple active distractors, over all noise types and across different SNRs.

* Under Review at ICASSP 2023

Via

Access Paper or Ask Questions

MuMMER: Socially Intelligent Human-Robot Interaction in Public Spaces

Sep 15, 2019

Mary Ellen Foster, Bart Craenen, Amol Deshmukh, Oliver Lemon, Emanuele Bastianelli, Christian Dondrup, Ioannis Papaioannou, Andrea Vanzo, Jean-Marc Odobez, Olivier Canévet(+20 more)

Figure 1 for MuMMER: Socially Intelligent Human-Robot Interaction in Public Spaces

Figure 2 for MuMMER: Socially Intelligent Human-Robot Interaction in Public Spaces

Figure 3 for MuMMER: Socially Intelligent Human-Robot Interaction in Public Spaces

Figure 4 for MuMMER: Socially Intelligent Human-Robot Interaction in Public Spaces

Abstract:In the EU-funded MuMMER project, we have developed a social robot designed to interact naturally and flexibly with users in public spaces such as a shopping mall. We present the latest version of the robot system developed during the project. This system encompasses audio-visual sensing, social signal processing, conversational interaction, perspective taking, geometric reasoning, and motion planning. It successfully combines all these components in an overarching framework using the Robot Operating System (ROS) and has been deployed to a shopping mall in Finland interacting with customers. In this paper, we describe the system components, their interplay, and the resulting robot behaviours and scenarios provided at the shopping mall.

Via

Access Paper or Ask Questions

Deep Neural Networks for Multiple Speaker Detection and Localization

Feb 26, 2018

Weipeng He, Petr Motlicek, Jean-Marc Odobez

Figure 1 for Deep Neural Networks for Multiple Speaker Detection and Localization

Figure 2 for Deep Neural Networks for Multiple Speaker Detection and Localization

Figure 3 for Deep Neural Networks for Multiple Speaker Detection and Localization

Figure 4 for Deep Neural Networks for Multiple Speaker Detection and Localization

Abstract:We propose to use neural networks for simultaneous detection and localization of multiple sound sources in human-robot interaction. In contrast to conventional signal processing techniques, neural network-based sound source localization methods require fewer strong assumptions about the environment. Previous neural network-based methods have been focusing on localizing a single sound source, which do not extend to multiple sources in terms of detection and localization. In this paper, we thus propose a likelihood-based encoding of the network output, which naturally allows the detection of an arbitrary number of sources. In addition, we investigate the use of sub-band cross-correlation information as features for better localization in sound mixtures, as well as three different network architectures based on different motivations. Experiments on real data recorded from a robot show that our proposed methods significantly outperform the popular spatial spectrum-based approaches.

* 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 2018, pp. 74-79
* Accepted for ICRA 2018

Via

Access Paper or Ask Questions