Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wenxuan Wu

Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction

Jun 11, 2025

Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li

Abstract:Audio-visual target speaker extraction (AV-TSE) models primarily rely on target visual cues to isolate the target speaker's voice from others. We know that humans leverage linguistic knowledge, such as syntax and semantics, to support speech perception. Inspired by this, we explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowledge sources for AV-TSE. In this study, we propose incorporating the linguistic constraints from PSLMs or PLMs for the AV-TSE model as additional supervision signals. Without introducing any extra computational cost during inference, the proposed approach consistently improves speech quality and intelligibility. Furthermore, we evaluate our method in multi-language settings and visual cue-impaired scenarios and show robust performance gains.

* Accepted by Interspeech 2025

Via

Access Paper or Ask Questions

$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction

Apr 01, 2025

Wenxuan Wu, Xueyuan Chen, Shuai Wang, Jiadong Wang, Lingwei Meng, Xixin Wu, Helen Meng, Haizhou Li

Abstract:Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues. Although numerous models have been proposed recently, most of them estimate target signals by primarily relying on local dependencies within acoustic features, underutilizing the human-like capacity to infer unclear parts of speech through contextual information. This limitation results in not only suboptimal performance but also inconsistent extraction quality across the utterance, with some segments exhibiting poor quality or inadequate suppression of interfering speakers. To close this gap, we propose a model-agnostic strategy called the Mask-And-Recover (MAR). It integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. Additionally, to better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model to assess extraction quality and guide extraction modules to emphasize improvement on low-quality segments. To validate the effectiveness of our proposed model-agnostic training paradigm, six popular AV-TSE backbones were adopted for evaluation on the VoxCeleb2 dataset, demonstrating consistent performance improvements across various metrics.

* Accepted by IEEE Journal of Selected Topics in Signal Processing (JSTSP)

Via

Access Paper or Ask Questions

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Sep 13, 2024

Lingwei Meng, Shujie Hu, Jiawen Kang, Zhaoqing Li, Yuejiao Wang, Wenxuan Wu, Xixin Wu, Xunying Liu, Helen Meng

Figure 1 for Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Figure 2 for Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Figure 3 for Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Figure 4 for Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Abstract:Recent advancements in large language models (LLMs) have revolutionized various domains, bringing significant progress and new opportunities. Despite progress in speech-related tasks, LLMs have not been sufficiently explored in multi-talker scenarios. In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. These representations are then fed into an LLM fine-tuned using LoRA, enabling the capabilities for speech comprehension and transcription. Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios, highlighting the potential of LLM to handle speech-related tasks based on user instructions in such complex settings.

Via

Access Paper or Ask Questions

Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy

Mar 24, 2024

Wenxuan Wu, Xueyuan Chen, Xixin Wu, Haizhou Li, Helen Meng

Abstract:Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to integrate a pre-trained AV-HuBERT into our AV-TSE system. We have good reasons to expect an improved performance. To benefit from the inter and intra-modality correlations, we also propose a novel Mask-And-Recover (MAR) strategy for self-supervised learning. The experimental results on the VoxCeleb2 dataset show that our proposed model outperforms the baselines both in terms of subjective and objective metrics, suggesting that the pre-trained AV-HuBERT model provides more informative visual cues for target speech extraction. Furthermore, through a comparative study, we confirm that the proposed Mask-And-Recover strategy is significantly effective.

* Accepted by IJCNN 2024

Via

Access Paper or Ask Questions

MetaSeg: Content-Aware Meta-Net for Omni-Supervised Semantic Segmentation

Jan 22, 2024

Shenwang Jiang, Jianan Li, Ying Wang, Wenxuan Wu, Jizhou Zhang, Bo Huang, Tingfa Xu

Figure 1 for MetaSeg: Content-Aware Meta-Net for Omni-Supervised Semantic Segmentation

Figure 2 for MetaSeg: Content-Aware Meta-Net for Omni-Supervised Semantic Segmentation

Figure 3 for MetaSeg: Content-Aware Meta-Net for Omni-Supervised Semantic Segmentation

Figure 4 for MetaSeg: Content-Aware Meta-Net for Omni-Supervised Semantic Segmentation

Abstract:Noisy labels, inevitably existing in pseudo segmentation labels generated from weak object-level annotations, severely hampers model optimization for semantic segmentation. Previous works often rely on massive hand-crafted losses and carefully-tuned hyper-parameters to resist noise, suffering poor generalization capability and high model complexity. Inspired by recent advances in meta learning, we argue that rather than struggling to tolerate noise hidden behind clean labels passively, a more feasible solution would be to find out the noisy regions actively, so as to simply ignore them during model optimization. With this in mind, this work presents a novel meta learning based semantic segmentation method, MetaSeg, that comprises a primary content-aware meta-net (CAM-Net) to sever as a noise indicator for an arbitrary segmentation model counterpart. Specifically, CAM-Net learns to generate pixel-wise weights to suppress noisy regions with incorrect pseudo labels while highlighting clean ones by exploiting hybrid strengthened features from image content, providing straightforward and reliable guidance for optimizing the segmentation model. Moreover, to break the barrier of time-consuming training when applying meta learning to common large segmentation models, we further present a new decoupled training strategy that optimizes different model layers in a divide-and-conquer manner. Extensive experiments on object, medical, remote sensing and human segmentation shows that our method achieves superior performance, approaching that of fully supervised settings, which paves a new promising way for omni-supervised semantic segmentation.

Via

Access Paper or Ask Questions

BrainZ-BP: A Non-invasive Cuff-less Blood Pressure Estimation Approach Leveraging Brain Bio-impedance and Electrocardiogram

Nov 23, 2023

Bufang Yang, Le Liu, Wenxuan Wu, Mengliang Zhou, Hongxing Liu, Xinbao Ning

Figure 1 for BrainZ-BP: A Non-invasive Cuff-less Blood Pressure Estimation Approach Leveraging Brain Bio-impedance and Electrocardiogram

Figure 2 for BrainZ-BP: A Non-invasive Cuff-less Blood Pressure Estimation Approach Leveraging Brain Bio-impedance and Electrocardiogram

Figure 3 for BrainZ-BP: A Non-invasive Cuff-less Blood Pressure Estimation Approach Leveraging Brain Bio-impedance and Electrocardiogram

Figure 4 for BrainZ-BP: A Non-invasive Cuff-less Blood Pressure Estimation Approach Leveraging Brain Bio-impedance and Electrocardiogram

Abstract:Accurate and continuous blood pressure (BP) monitoring is essential to the early prevention of cardiovascular diseases. Non-invasive and cuff-less BP estimation algorithm has gained much attention in recent years. Previous studies have demonstrated that brain bio-impedance (BIOZ) is a promising technique for non-invasive intracranial pressure (ICP) monitoring. Clinically, treatment for patients with traumatic brain injuries (TBI) requires monitoring the ICP and BP of patients simultaneously. Estimating BP by brain BIOZ directly can reduce the number of sensors attached to the patients, thus improving their comfort. To address the issues, in this study, we explore the feasibility of leveraging brain BIOZ for BP estimation and propose a novel cuff-less BP estimation approach called BrainZ-BP. Two electrodes are placed on the forehead and occipital bone of the head in the anterior-posterior direction for brain BIOZ measurement. Various features including pulse transit time and morphological features of brain BIOZ are extracted and fed into four regression models for BP estimation. Results show that the mean absolute error, root mean square error, and correlation coefficient of random forest regression model are 2.17 mmHg, 3.91 mmHg, and 0.90 for systolic pressure estimation, and are 1.71 mmHg, 3.02 mmHg, and 0.89 for diastolic pressure estimation. The presented BrainZ-BP can be applied in the brain BIOZ-based ICP monitoring scenario to monitor BP simultaneously.

Via

Access Paper or Ask Questions

Kick Bad Guys Out! Zero-Knowledge-Proof-Based Anomaly Detection in Federated Learning

Oct 06, 2023

Shanshan Han, Wenxuan Wu, Baturalp Buyukates, Weizhao Jin, Yuhang Yao, Qifan Zhang, Salman Avestimehr, Chaoyang He

Figure 1 for Kick Bad Guys Out! Zero-Knowledge-Proof-Based Anomaly Detection in Federated Learning

Figure 2 for Kick Bad Guys Out! Zero-Knowledge-Proof-Based Anomaly Detection in Federated Learning

Figure 3 for Kick Bad Guys Out! Zero-Knowledge-Proof-Based Anomaly Detection in Federated Learning

Figure 4 for Kick Bad Guys Out! Zero-Knowledge-Proof-Based Anomaly Detection in Federated Learning

Abstract:Federated learning (FL) systems are vulnerable to malicious clients that submit poisoned local models to achieve their adversarial goals, such as preventing the convergence of the global model or inducing the global model to misclassify some data. Many existing defense mechanisms are impractical in real-world FL systems, as they require prior knowledge of the number of malicious clients or rely on re-weighting or modifying submissions. This is because adversaries typically do not announce their intentions before attacking, and re-weighting might change aggregation results even in the absence of attacks. To address these challenges in real FL systems, this paper introduces a cutting-edge anomaly detection approach with the following features: i) Detecting the occurrence of attacks and performing defense operations only when attacks happen; ii) Upon the occurrence of an attack, further detecting the malicious client models and eliminating them without harming the benign ones; iii) Ensuring honest execution of defense mechanisms at the server by leveraging a zero-knowledge proof mechanism. We validate the superior performance of the proposed approach with extensive experiments.

Via

Access Paper or Ask Questions

PointConvFormer: Revenge of the Point-based Convolution

Aug 04, 2022

Wenxuan Wu, Qi Shan, Li Fuxin

Figure 1 for PointConvFormer: Revenge of the Point-based Convolution

Figure 2 for PointConvFormer: Revenge of the Point-based Convolution

Figure 3 for PointConvFormer: Revenge of the Point-based Convolution

Figure 4 for PointConvFormer: Revenge of the Point-based Convolution

Abstract:We introduce PointConvFormer, a novel building block for point cloud based deep neural network architectures. Inspired by generalization theory, PointConvFormer combines ideas from point convolution, where filter weights are only based on relative position, and Transformers which utilizes feature-based attention. In PointConvFormer, feature difference between points in the neighborhood serves as an indicator to re-weight the convolutional weights. Hence, we preserved the invariances from the point convolution operation whereas attention is used to select relevant points in the neighborhood for convolution. To validate the effectiveness of PointConvFormer, we experiment on both semantic segmentation and scene flow estimation tasks on point clouds with multiple datasets including ScanNet, SemanticKitti, FlyingThings3D and KITTI. Our results show that PointConvFormer substantially outperforms classic convolutions, regular transformers, and voxelized sparse convolution approaches with smaller, more computationally efficient networks. Visualizations show that PointConvFormer performs similarly to convolution on flat surfaces, whereas the neighborhood selection effect is stronger on object boundaries, showing that it got the best of both worlds.

Via

Access Paper or Ask Questions

The Devils in the Point Clouds: Studying the Robustness of Point Cloud Convolutions

Jan 28, 2021

Xingyi Li, Wenxuan Wu, Xiaoli Z. Fern, Li Fuxin

Figure 1 for The Devils in the Point Clouds: Studying the Robustness of Point Cloud Convolutions

Figure 2 for The Devils in the Point Clouds: Studying the Robustness of Point Cloud Convolutions

Figure 3 for The Devils in the Point Clouds: Studying the Robustness of Point Cloud Convolutions

Figure 4 for The Devils in the Point Clouds: Studying the Robustness of Point Cloud Convolutions

Abstract:Recently, there has been a significant interest in performing convolution over irregularly sampled point clouds. Since point clouds are very different from regular raster images, it is imperative to study the generalization of the convolution networks more closely, especially their robustness under variations in scale and rotations of the input data. This paper investigates different variants of PointConv, a convolution network on point clouds, to examine their robustness to input scale and rotation changes. Of the variants we explored, two are novel and generated significant improvements. The first is replacing the multilayer perceptron based weight function with much simpler third degree polynomials, together with a Sobolev norm regularization. Secondly, for 3D datasets, we derive a novel viewpoint-invariant descriptor by utilizing 3D geometric properties as the input to PointConv, in addition to the regular 3D coordinates. We have also explored choices of activation functions, neighborhood, and subsampling methods. Experiments are conducted on the 2D MNIST & CIFAR-10 datasets as well as the 3D SemanticKITTI & ScanNet datasets. Results reveal that on 2D, using third degree polynomials greatly improves PointConv's robustness to scale changes and rotations, even surpassing traditional 2D CNNs for the MNIST dataset. On 3D datasets, the novel viewpoint-invariant descriptor significantly improves the performance as well as robustness of PointConv. We achieve the state-of-the-art semantic segmentation performance on the SemanticKITTI dataset, as well as comparable performance with the current highest framework on the ScanNet dataset among point-based approaches.

Via

Access Paper or Ask Questions

PointPWC-Net: A Coarse-to-Fine Network for Supervised and Self-Supervised Scene Flow Estimation on 3D Point Clouds

Nov 27, 2019

Wenxuan Wu, Zhiyuan Wang, Zhuwen Li, Wei Liu, Li Fuxin

Figure 1 for PointPWC-Net: A Coarse-to-Fine Network for Supervised and Self-Supervised Scene Flow Estimation on 3D Point Clouds

Figure 2 for PointPWC-Net: A Coarse-to-Fine Network for Supervised and Self-Supervised Scene Flow Estimation on 3D Point Clouds

Figure 3 for PointPWC-Net: A Coarse-to-Fine Network for Supervised and Self-Supervised Scene Flow Estimation on 3D Point Clouds

Figure 4 for PointPWC-Net: A Coarse-to-Fine Network for Supervised and Self-Supervised Scene Flow Estimation on 3D Point Clouds

Abstract:We propose a novel end-to-end deep scene flow model, called PointPWC-Net, on 3D point clouds in a coarse-to-fine fashion. Flow computed at the coarse level is upsampled and warped to a finer level, enabling the algorithm to accommodate for large motion without a prohibitive search space. We introduce novel cost volume, upsampling, and warping layers to efficiently handle 3D point cloud data. Unlike traditional cost volumes that require exhaustively computing all the cost values on a high-dimensional grid, our point-based formulation discretizes the cost volume onto input 3D points, and a PointConv operation efficiently computes convolutions on the cost volume. Experiment results on FlyingThings3D outperform the state-of-the-art by a large margin. We further explore novel self-supervised losses to train our model and achieve comparable results to state-of-the-art trained with supervised loss. Without any fine-tuning, our method also shows great generalization ability on KITTI Scene Flow 2015 dataset, outperforming all previous methods.

Via

Access Paper or Ask Questions