Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xihong Wu

Automotive sound field reproduction using deep optimization with spatial domain constraint

Sep 11, 2025

Yufan Qian, Tianshu Qu, Xihong Wu

Abstract:Sound field reproduction with undistorted sound quality and precise spatial localization is desirable for automotive audio systems. However, the complexity of automotive cabin acoustic environment often necessitates a trade-off between sound quality and spatial accuracy. To overcome this limitation, we propose Spatial Power Map Net (SPMnet), a learning-based sound field reproduction method that improves both sound quality and spatial localization in complex environments. We introduce a spatial power map (SPM) constraint, which characterizes the angular energy distribution of the reproduced field using beamforming. This constraint guides energy toward the intended direction to enhance spatial localization, and is integrated into a multi-channel equalization framework to also improve sound quality under reverberant conditions. To address the resulting non-convexity, deep optimization that use neural networks to solve optimization problems is employed for filter design. Both in situ objective and subjective evaluations confirm that our method enhances sound quality and improves spatial localization within the automotive cabin. Furthermore, we analyze the influence of different audio materials and the arrival angles of the virtual sound source in the reproduced sound field, investigating the potential underlying factors affecting these results.

* 41 pages, 9 figures, Revised and submitted to The Journal of the Acoustical Society of America (JASA)

Via

Access Paper or Ask Questions

DPGP: A Hybrid 2D-3D Dual Path Potential Ghost Probe Zone Prediction Framework for Safe Autonomous Driving

Apr 23, 2025

Weiming Qu, Jiawei Du, Shenghai Yuan, Jia Wang, Yang Sun, Shengyi Liu, Yuanhao Zhu, Jianfeng Yu, Song Cao, Rui Xia(+3 more)

Figure 1 for DPGP: A Hybrid 2D-3D Dual Path Potential Ghost Probe Zone Prediction Framework for Safe Autonomous Driving

Figure 2 for DPGP: A Hybrid 2D-3D Dual Path Potential Ghost Probe Zone Prediction Framework for Safe Autonomous Driving

Figure 3 for DPGP: A Hybrid 2D-3D Dual Path Potential Ghost Probe Zone Prediction Framework for Safe Autonomous Driving

Figure 4 for DPGP: A Hybrid 2D-3D Dual Path Potential Ghost Probe Zone Prediction Framework for Safe Autonomous Driving

Abstract:Modern robots must coexist with humans in dense urban environments. A key challenge is the ghost probe problem, where pedestrians or objects unexpectedly rush into traffic paths. This issue affects both autonomous vehicles and human drivers. Existing works propose vehicle-to-everything (V2X) strategies and non-line-of-sight (NLOS) imaging for ghost probe zone detection. However, most require high computational power or specialized hardware, limiting real-world feasibility. Additionally, many methods do not explicitly address this issue. To tackle this, we propose DPGP, a hybrid 2D-3D fusion framework for ghost probe zone prediction using only a monocular camera during training and inference. With unsupervised depth prediction, we observe ghost probe zones align with depth discontinuities, but different depth representations offer varying robustness. To exploit this, we fuse multiple feature embeddings to improve prediction. To validate our approach, we created a 12K-image dataset annotated with ghost probe zones, carefully sourced and cross-checked for accuracy. Experimental results show our framework outperforms existing methods while remaining cost-effective. To our knowledge, this is the first work extending ghost probe zone prediction beyond vehicles, addressing diverse non-vehicle objects. We will open-source our code and dataset for community benefit.

Via

Access Paper or Ask Questions

Using Ear-EEG to Decode Auditory Attention in Multiple-speaker Environment

Sep 13, 2024

Haolin Zhu, Yujie Yan, Xiran Xu, Zhongshu Ge, Pei Tian, Xihong Wu, Jing Chen

Figure 1 for Using Ear-EEG to Decode Auditory Attention in Multiple-speaker Environment

Figure 2 for Using Ear-EEG to Decode Auditory Attention in Multiple-speaker Environment

Figure 3 for Using Ear-EEG to Decode Auditory Attention in Multiple-speaker Environment

Figure 4 for Using Ear-EEG to Decode Auditory Attention in Multiple-speaker Environment

Abstract:Auditory Attention Decoding (AAD) can help to determine the identity of the attended speaker during an auditory selective attention task, by analyzing and processing measurements of electroencephalography (EEG) data. Most studies on AAD are based on scalp-EEG signals in two-speaker scenarios, which are far from real application. Ear-EEG has recently gained significant attention due to its motion tolerance and invisibility during data acquisition, making it easy to incorporate with other devices for applications. In this work, participants selectively attended to one of the four spatially separated speakers' speech in an anechoic room. The EEG data were concurrently collected from a scalp-EEG system and an ear-EEG system (cEEGrids). Temporal response functions (TRFs) and stimulus reconstruction (SR) were utilized using ear-EEG data. Results showed that the attended speech TRFs were stronger than each unattended speech and decoding accuracy was 41.3\% in the 60s (chance level of 25\%). To further investigate the impact of electrode placement and quantity, SR was utilized in both scalp-EEG and ear-EEG, revealing that while the number of electrodes had a minor effect, their positioning had a significant influence on the decoding accuracy. One kind of auditory spatial attention detection (ASAD) method, STAnet, was testified with this ear-EEG database, resulting in 93.1% in 1-second decoding window. The implementation code and database for our work are available on GitHub: https://github.com/zhl486/Ear_EEG_code.git and Zenodo: https://zenodo.org/records/10803261.

Via

Access Paper or Ask Questions

Cross-attention Inspired Selective State Space Models for Target Sound Extraction

Sep 10, 2024

Donghang Wu, Yiwen Wang, Xihong Wu, Tianshu Qu

Figure 1 for Cross-attention Inspired Selective State Space Models for Target Sound Extraction

Figure 2 for Cross-attention Inspired Selective State Space Models for Target Sound Extraction

Figure 3 for Cross-attention Inspired Selective State Space Models for Target Sound Extraction

Figure 4 for Cross-attention Inspired Selective State Space Models for Target Sound Extraction

Abstract:The Transformer model, particularly its cross-attention module, is widely used for feature fusion in target sound extraction which extracts the signal of interest based on given clues. Despite its effectiveness, this approach suffers from low computational efficiency. Recent advancements in state space models, notably the latest work Mamba, have shown comparable performance to Transformer-based methods while significantly reducing computational complexity in various tasks. However, Mamba's applicability in target sound extraction is limited due to its inability to capture dependencies between different sequences as the cross-attention does. In this paper, we propose CrossMamba for target sound extraction, which leverages the hidden attention mechanism of Mamba to compute dependencies between the given clues and the audio mixture. The calculation of Mamba can be divided to the query, key and value. We utilize the clue to generate the query and the audio mixture to derive the key and value, adhering to the principle of the cross-attention mechanism in Transformers. Experimental results from two representative target sound extraction methods validate the efficacy of the proposed CrossMamba.

* 5 pages, 2 figures, submitted to ICASSP 2025

Via

Access Paper or Ask Questions

DENSE: Dynamic Embedding Causal Target Speech Extraction

Sep 10, 2024

Yiwen Wang, Zeyu Yuan, Xihong Wu

Figure 1 for DENSE: Dynamic Embedding Causal Target Speech Extraction

Figure 2 for DENSE: Dynamic Embedding Causal Target Speech Extraction

Figure 3 for DENSE: Dynamic Embedding Causal Target Speech Extraction

Figure 4 for DENSE: Dynamic Embedding Causal Target Speech Extraction

Abstract:Target speech extraction (TSE) focuses on extracting the speech of a specific target speaker from a mixture of signals. Existing TSE models typically utilize static embeddings as conditions for extracting the target speaker's voice. However, the static embeddings often fail to capture the contextual information of the extracted speech signal, which may limit the model's performance. We propose a novel dynamic embedding causal target speech extraction model to address this limitation. Our approach incorporates an autoregressive mechanism to generate context-dependent embeddings based on the extracted speech, enabling real-time, frame-level extraction. Experimental results demonstrate that the proposed model enhances short-time objective intelligibility (STOI) and signal-to-distortion ratio (SDR), offering a promising solution for target speech extraction in challenging scenarios.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

Leveraging Moving Sound Source Trajectories for Universal Sound Separation

Sep 07, 2024

Donghang Wu, Xihong Wu, Tianshu Qu

Figure 1 for Leveraging Moving Sound Source Trajectories for Universal Sound Separation

Figure 2 for Leveraging Moving Sound Source Trajectories for Universal Sound Separation

Figure 3 for Leveraging Moving Sound Source Trajectories for Universal Sound Separation

Figure 4 for Leveraging Moving Sound Source Trajectories for Universal Sound Separation

Abstract:Existing methods utilizing spatial information for sound source separation require prior knowledge of the direction of arrival (DOA) of the source or utilize estimated but imprecise localization results, which impairs the separation performance, especially when the sound sources are moving. In fact, sound source localization and separation are interconnected problems, that is, sound source localization facilitates sound separation while sound separation contributes to more precise source localization. This paper proposes a method utilizing the mutual facilitation mechanism between sound source localization and separation for moving sources. Initially, sound separation is conducted using rough preliminary sound source tracking results. Sound source tracking is then performed on the separated signals thus the tracking results can become more precise. The precise trajectory can further enhances the separation performance. This mutual facilitation process can be performed over several iterations. Simulation experiments conducted under reverberation conditions and with moving sound sources demonstrate that the proposed method can achieve more accurate separation based on more precise tracking results.

* 9 pages,7 figures,submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing(TASLP)

Via

Access Paper or Ask Questions

TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information

Jun 13, 2024

Yiwen Wang, Xihong Wu

Figure 1 for TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information

Figure 2 for TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information

Figure 3 for TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information

Figure 4 for TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information

Abstract:Target sound extraction (TSE) separates the target sound from the mixture signals based on provided clues. However, the performance of existing models significantly degrades under reverberant conditions. Inspired by auditory scene analysis (ASA), this work proposes a TSE model provided with pitch information named TSE-PI. Conditional pitch extraction is achieved through the Feature-wise Linearly Modulated layer with the sound-class label. A modified Waveformer model combined with pitch information, employing a learnable Gammatone filterbank in place of the convolutional encoder, is used for target sound extraction. The inclusion of pitch information is aimed at improving the model's performance. The experimental results on the FSD50K dataset illustrate 2.4 dB improvements of target sound extraction under reverberant environments when incorporating pitch information and Gammatone filterbank.

* Accepted by Interspeech2024

Via

Access Paper or Ask Questions

Beware of Overestimated Decoding Performance Arising from Temporal Autocorrelations in Electroencephalogram Signals

May 27, 2024

Xiran Xu, Bo Wang, Boda Xiao, Yadong Niu, Yiwen Wang, Xihong Wu, Jing Chen

Figure 1 for Beware of Overestimated Decoding Performance Arising from Temporal Autocorrelations in Electroencephalogram Signals

Figure 2 for Beware of Overestimated Decoding Performance Arising from Temporal Autocorrelations in Electroencephalogram Signals

Figure 3 for Beware of Overestimated Decoding Performance Arising from Temporal Autocorrelations in Electroencephalogram Signals

Figure 4 for Beware of Overestimated Decoding Performance Arising from Temporal Autocorrelations in Electroencephalogram Signals

Abstract:Researchers have reported high decoding accuracy (>95%) using non-invasive Electroencephalogram (EEG) signals for brain-computer interface (BCI) decoding tasks like image decoding, emotion recognition, auditory spatial attention detection, etc. Since these EEG data were usually collected with well-designed paradigms in labs, the reliability and robustness of the corresponding decoding methods were doubted by some researchers, and they argued that such decoding accuracy was overestimated due to the inherent temporal autocorrelation of EEG signals. However, the coupling between the stimulus-driven neural responses and the EEG temporal autocorrelations makes it difficult to confirm whether this overestimation exists in truth. Furthermore, the underlying pitfalls behind overestimated decoding accuracy have not been fully explained due to a lack of appropriate formulation. In this work, we formulate the pitfall in various EEG decoding tasks in a unified framework. EEG data were recorded from watermelons to remove stimulus-driven neural responses. Labels were assigned to continuous EEG according to the experimental design for EEG recording of several typical datasets, and then the decoding methods were conducted. The results showed the label can be successfully decoded as long as continuous EEG data with the same label were split into training and test sets. Further analysis indicated that high accuracy of various BCI decoding tasks could be achieved by associating labels with EEG intrinsic temporal autocorrelation features. These results underscore the importance of choosing the right experimental designs and data splits in BCI decoding tasks to prevent inflated accuracies due to EEG temporal autocorrelation.

Via

Access Paper or Ask Questions

Self-supervised speech representation and contextual text embedding for match-mismatch classification with EEG recording

Feb 01, 2024

Bo Wang, Xiran Xu, Zechen Zhang, Haolin Zhu, YuJie Yan, Xihong Wu, Jing Chen

Figure 1 for Self-supervised speech representation and contextual text embedding for match-mismatch classification with EEG recording

Figure 2 for Self-supervised speech representation and contextual text embedding for match-mismatch classification with EEG recording

Abstract:Relating speech to EEG holds considerable importance but is challenging. In this study, a deep convolutional network was employed to extract spatiotemporal features from EEG data. Self-supervised speech representation and contextual text embedding were used as speech features. Contrastive learning was used to relate EEG features to speech features. The experimental results demonstrate the benefits of using self-supervised speech representation and contextual text embedding. Through feature fusion and model ensemble, an accuracy of 60.29% was achieved, and the performance was ranked as No.2 in Task 1 of the Auditory EEG Challenge (ICASSP 2024). The code to implement our work is available on Github: https://github.com/bobwangPKU/EEG-Stimulus-Match-Mismatch.

* 2 pages, 2 figures, accepted by ICASSP 2024

Via

Access Paper or Ask Questions

ConvConcatNet: a deep convolutional neural network to reconstruct mel spectrogram from the EEG

Jan 10, 2024

Xiran Xu, Bo Wang, Yujie Yan, Haolin Zhu, Zechen Zhang, Xihong Wu, Jing Chen

Abstract:To investigate the processing of speech in the brain, simple linear models are commonly used to establish a relationship between brain signals and speech features. However, these linear models are ill-equipped to model a highly dynamic and complex non-linear system like the brain. Although non-linear methods with neural networks have been developed recently, reconstructing unseen stimuli from unseen subjects' EEG is still a highly challenging task. This work presents a novel method, ConvConcatNet, to reconstruct mel-specgrams from EEG, in which the deep convolution neural network and extensive concatenation operation were combined. With our ConvConcatNet model, the Pearson correlation between the reconstructed and the target mel-spectrogram can achieve 0.0420, which was ranked as No.1 in the Task 2 of the Auditory EEG Challenge. The codes and models to implement our work will be available on Github: https://github.com/xuxiran/ConvConcatNet

* 2 pages, 1 figure, 2 tables

Via

Access Paper or Ask Questions