Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiadong Wang

$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction

Apr 01, 2025

Wenxuan Wu, Xueyuan Chen, Shuai Wang, Jiadong Wang, Lingwei Meng, Xixin Wu, Helen Meng, Haizhou Li

Abstract:Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues. Although numerous models have been proposed recently, most of them estimate target signals by primarily relying on local dependencies within acoustic features, underutilizing the human-like capacity to infer unclear parts of speech through contextual information. This limitation results in not only suboptimal performance but also inconsistent extraction quality across the utterance, with some segments exhibiting poor quality or inadequate suppression of interfering speakers. To close this gap, we propose a model-agnostic strategy called the Mask-And-Recover (MAR). It integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. Additionally, to better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model to assess extraction quality and guide extraction modules to emphasize improvement on low-quality segments. To validate the effectiveness of our proposed model-agnostic training paradigm, six popular AV-TSE backbones were adopted for evaluation on the VoxCeleb2 dataset, demonstrating consistent performance improvements across various metrics.

* Accepted by IEEE Journal of Selected Topics in Signal Processing (JSTSP)

Via

Access Paper or Ask Questions

Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection

Sep 11, 2024

Xinyuan Qian, Xianghu Yue, Jiadong Wang, Huiping Zhuang, Haizhou Li

Figure 1 for Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection

Figure 2 for Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection

Figure 3 for Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection

Figure 4 for Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection

Abstract:Sound Source Localization (SSL) enabling technology for applications such as surveillance and robotics. While traditional Signal Processing (SP)-based SSL methods provide analytic solutions under specific signal and noise assumptions, recent Deep Learning (DL)-based methods have significantly outperformed them. However, their success depends on extensive training data and substantial computational resources. Moreover, they often rely on large-scale annotated spatial data and may struggle when adapting to evolving sound classes. To mitigate these challenges, we propose a novel Class Incremental Learning (CIL) approach, termed SSL-CIL, which avoids serious accuracy degradation due to catastrophic forgetting by incrementally updating the DL-based SSL model through a closed-form analytic solution. In particular, data privacy is ensured since the learning process does not revisit any historical data (exemplar-free), which is more suitable for smart home scenarios. Empirical results in the public SSLR dataset demonstrate the superior performance of our proposal, achieving a localization accuracy of 90.9%, surpassing other competitive methods.

Via

Access Paper or Ask Questions

Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Aug 29, 2024

Qianhui Liu, Jiadong Wang, Yang Wang, Xin Yang, Gang Pan, Haizhou Li

Figure 1 for Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Figure 2 for Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Figure 3 for Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Figure 4 for Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Abstract:Humans naturally perform audiovisual speech recognition (AVSR), enhancing the accuracy and robustness by integrating auditory and visual information. Spiking neural networks (SNNs), which mimic the brain's information-processing mechanisms, are well-suited for emulating the human capability of AVSR. Despite their potential, research on SNNs for AVSR is scarce, with most existing audio-visual multimodal methods focused on object or digit recognition. These models simply integrate features from both modalities, neglecting their unique characteristics and interactions. Additionally, they often rely on future information for current processing, which increases recognition latency and limits real-time applicability. Inspired by human speech perception, this paper proposes a novel human-inspired SNN named HI-AVSNN for AVSR, incorporating three key characteristics: cueing interaction, causal processing and spike activity. For cueing interaction, we propose a visual-cued auditory attention module (VCA2M) that leverages visual cues to guide attention to auditory features. We achieve causal processing by aligning the SNN's temporal dimension with that of visual and auditory features and applying temporal masking to utilize only past and current information. To implement spike activity, in addition to using SNNs, we leverage the event camera to capture lip movement as spikes, mimicking the human retina and providing efficient visual data. We evaluate HI-AVSNN on an audiovisual speech recognition dataset combining the DVS-Lip dataset with its corresponding audio samples. Experimental results demonstrate the superiority of our proposed fusion method, outperforming existing audio-visual SNN fusion methods and achieving a 2.27% improvement in accuracy over the only existing SNN-based AVSR method.

Via

Access Paper or Ask Questions

Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal

Jul 24, 2024

Yeying Jin, Xin Li, Jiadong Wang, Yan Zhang, Malu Zhang

Figure 1 for Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal

Figure 2 for Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal

Figure 3 for Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal

Figure 4 for Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal

Abstract:Existing raindrop removal datasets have two shortcomings. First, they consist of images captured by cameras with a focus on the background, leading to the presence of blurry raindrops. To our knowledge, none of these datasets include images where the focus is specifically on raindrops, which results in a blurry background. Second, these datasets predominantly consist of daytime images, thereby lacking nighttime raindrop scenarios. Consequently, algorithms trained on these datasets may struggle to perform effectively in raindrop-focused or nighttime scenarios. The absence of datasets specifically designed for raindrop-focused and nighttime raindrops constrains research in this area. In this paper, we introduce a large-scale, real-world raindrop removal dataset called Raindrop Clarity. Raindrop Clarity comprises 15,186 high-quality pairs/triplets (raindrops, blur, and background) of images with raindrops and the corresponding clear background images. There are 5,442 daytime raindrop images and 9,744 nighttime raindrop images. Specifically, the 5,442 daytime images include 3,606 raindrop- and 1,836 background-focused images. While the 9,744 nighttime images contain 4,838 raindrop- and 4,906 background-focused images. Our dataset will enable the community to explore background-focused and raindrop-focused images, including challenges unique to daytime and nighttime conditions. Our data and code are available at: \url{https://github.com/jinyeying/RaindropClarity}

* Accepted to ECCV2024, dataset and benchmark at: \url{https://github.com/jinyeying/RaindropClarity}

Via

Access Paper or Ask Questions

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Apr 29, 2024

Ruijie Tao, Xinyuan Qian, Yidi Jiang, Junjie Li, Jiadong Wang, Haizhou Li

Figure 1 for Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Figure 2 for Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Figure 3 for Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Figure 4 for Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Abstract:Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this strategy mainly focuses on the existence of target speech, while ignoring the variations of the noise characteristics. That may result in extracting noisy signals from the incorrect sound source in challenging acoustic situations. To this end, we propose a novel reverse selective auditory attention mechanism, which can suppress interference speakers and non-speech signals to avoid incorrect speaker extraction. By estimating and utilizing the undesired noisy signal through this mechanism, we design an AV-TSE framework named Subtraction-and-ExtrAction network (SEANet) to suppress the noisy signals. We conduct abundant experiments by re-implementing three popular AV-TSE methods as the baselines and involving nine metrics for evaluation. The experimental results show that our proposed SEANet achieves state-of-the-art results and performs well for all five datasets. We will release the codes, the models and data logs.

Via

Access Paper or Ask Questions

Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training

Apr 01, 2024

Ruijie Tao, Xinyuan Qian, Rohan Kumar Das, Xiaoxue Gao, Jiadong Wang, Haizhou Li

Figure 1 for Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training

Figure 2 for Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training

Figure 3 for Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training

Figure 4 for Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training

Abstract:Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in addressing the challenges from real-world AV-ASD scenarios. Due to the presence of low-quality noisy videos in such cases, AV-ASD systems without a selective listening ability are short of effectively filtering out disruptive voice components from mixed audio inputs. In this paper, we propose a Multi-modal Speaker Extraction-to-Detection framework named `MuSED', which is pre-trained with audio-visual target speaker extraction to learn the denoising ability, then it is fine-tuned with the AV-ASD task. Meanwhile, to better capture the multi-modal information and deal with real-world problems such as missing modality, MuSED is modelled on the time domain directly and integrates the multi-modal plus-and-minus augmentation strategy. Our experiments demonstrate that MuSED substantially outperforms the state-of-the-art AV-ASD methods and achieves 95.6% mAP on the AVA-ActiveSpeaker dataset, 98.3% AP on the ASW dataset, and 97.9% F1 on the Columbia AV-ASD dataset, respectively. We will publicly release the code in due course.

* 10 pages

Via

Access Paper or Ask Questions

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Mar 29, 2023

Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T. Tan, Haizhou Li

Figure 1 for Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Figure 2 for Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Figure 3 for Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Figure 4 for Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Abstract:Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input. The previous studies revealed the importance of lip-speech synchronization and visual quality. Despite much progress, they hardly focus on the content of lip movements i.e., the visual intelligibility of the spoken words, which is an important aspect of generation quality. To address the problem, we propose using a lip-reading expert to improve the intelligibility of the generated lip regions by penalizing the incorrect generation results. Moreover, to compensate for data scarcity, we train the lip-reading expert in an audio-visual self-supervised manner. With a lip-reading expert, we propose a novel contrastive learning to enhance lip-speech synchronization, and a transformer to encode audio synchronically with video, while considering global temporal dependency of audio. For evaluation, we propose a new strategy with two different lip-reading experts to measure intelligibility of the generated videos. Rigorous experiments show that our proposal is superior to other State-of-the-art (SOTA) methods, such as Wav2Lip, in reading intelligibility i.e., over 38% Word Error Rate (WER) on LRS2 dataset and 27.8% accuracy on LRW dataset. We also achieve the SOTA performance in lip-speech synchronization and comparable performances in visual quality.

* accepted by CVPR 2023

Via

Access Paper or Ask Questions

Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception

Sep 05, 2022

Jiadong Wang, Xinyuan Qian, Haizhou Li

Figure 1 for Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception

Figure 2 for Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception

Figure 3 for Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception

Figure 4 for Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception

Abstract:Audio and visual signals complement each other in human speech perception, so do they in speech recognition. The visual hint is less evident than the acoustic hint, but more robust in a complex acoustic environment, as far as speech perception is concerned. It remains a challenge how we effectively exploit the interaction between audio and visual signals for automatic speech recognition. There have been studies to exploit visual signals as redundant or complementary information to audio input in a synchronous manner. Human studies suggest that visual signal primes the listener in advance as to when and on which frequency to attend to. We propose a Predict-and-Update Network (P&U net), to simulate such a visual cueing mechanism for Audio-Visual Speech Recognition (AVSR). In particular, we first predict the character posteriors of the spoken words, i.e. the visual embedding, based on the visual signals. The audio signal is then conditioned on the visual embedding via a novel cross-modal Conformer, that updates the character posteriors. We validate the effectiveness of the visual cueing mechanism through extensive experiments. The proposed P&U net outperforms the state-of-the-art AVSR methods on both LRS2-BBC and LRS3-BBC datasets, with the relative reduced Word Error Rate (WER)s exceeding 10% and 40% under clean and noisy conditions, respectively.

Via

Access Paper or Ask Questions

Multi-target DoA Estimation with an Audio-visual Fusion Mechanism

May 13, 2021

Xinyuan Qian, Maulik Madhavi, Zexu Pan, Jiadong Wang, Haizhou Li

Figure 1 for Multi-target DoA Estimation with an Audio-visual Fusion Mechanism

Figure 2 for Multi-target DoA Estimation with an Audio-visual Fusion Mechanism

Figure 3 for Multi-target DoA Estimation with an Audio-visual Fusion Mechanism

Figure 4 for Multi-target DoA Estimation with an Audio-visual Fusion Mechanism

Abstract:Most of the prior studies in the spatial \ac{DoA} domain focus on a single modality. However, humans use auditory and visual senses to detect the presence of sound sources. With this motivation, we propose to use neural networks with audio and visual signals for multi-speaker localization. The use of heterogeneous sensors can provide complementary information to overcome uni-modal challenges, such as noise, reverberation, illumination variations, and occlusions. We attempt to address these issues by introducing an adaptive weighting mechanism for audio-visual fusion. We also propose a novel video simulation method that generates visual features from noisy target 3D annotations that are synchronized with acoustic features. Experimental results confirm that audio-visual fusion consistently improves the performance of speaker DoA estimation, while the adaptive weighting mechanism shows clear benefits.

* ICASSP 2021 accepted

Via

Access Paper or Ask Questions

Self-Exploration in Complex Unknown Environments using Hybrid Map Representation

Apr 18, 2020

Wenchao Gao, Matthew Booker, Jiadong Wang

Figure 1 for Self-Exploration in Complex Unknown Environments using Hybrid Map Representation

Figure 2 for Self-Exploration in Complex Unknown Environments using Hybrid Map Representation

Figure 3 for Self-Exploration in Complex Unknown Environments using Hybrid Map Representation

Figure 4 for Self-Exploration in Complex Unknown Environments using Hybrid Map Representation

Abstract:A hybrid map representation, which consists of a modified generalized Voronoi Diagram (GVD)-based topological map and a grid-based metric map, is proposed to facilitate a new frontier-driven exploration strategy. Exploration frontiers are the regions on the boundary between open space and unexplored space. A mobile robot is able to construct its map by adding new space and moving to unvisited frontiers until the entire environment has been explored. The existing exploration methods suffer from low exploration efficiency in complex environments due to the lack of a systematical way to determine and assign optimal exploration command. Leveraging on the abstracted information from the GVD map (global) and the detected frontier in the local sliding window, a global-local exploration strategy is proposed to handle the exploration task in a hierarchical manner. The new exploration algorithm is able to create a modified tree structure to represent the environment while consolidating global frontier information during the self-exploration. The proposed method is verified in simulated environments, and then tested in real-world office environments as well.

Via

Access Paper or Ask Questions