Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wenyu Zhang

Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models

Aug 10, 2025

Qiongqiong Wang, Hardik B. Sailor, Jeremy H. M. Wong, Tianchi Liu, Shuo Sun, Wenyu Zhang, Muhammad Huzaifah, Nancy Chen, Ai Ti Aw

Abstract:Current large speech language models (Speech-LLMs) often exhibit limitations in empathetic reasoning, primarily due to the absence of training datasets that integrate both contextual content and paralinguistic cues. In this work, we propose two approaches to incorporate contextual paralinguistic information into model training: (1) an explicit method that provides paralinguistic metadata (e.g., emotion annotations) directly to the LLM, and (2) an implicit method that automatically generates novel training question-answer (QA) pairs using both categorical and dimensional emotion annotations alongside speech transcriptions. Our implicit method boosts performance (LLM-judged) by 38.41% on a human-annotated QA benchmark, reaching 46.02% when combined with the explicit approach, showing effectiveness in contextual paralinguistic understanding. We also validate the LLM judge by demonstrating its correlation with classification metrics, providing support for its reliability.

* Accepted at (ASRU 2025) 2025 IEEE Automatic Speech Recognition and Understanding Workshop

Via

Access Paper or Ask Questions

Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs

Jun 07, 2025

Wenyu Zhang, Yingxu He, Geyu Lin, Zhuohan Liu, Shuo Sun, Bin Wang, Xunlong Zou, Jeremy H. M. Wong, Qiongqiong Wang, Hardik B. Sailor(+2 more)

Abstract:Audio Large Language Models (AudioLLMs) have achieved strong results in semantic tasks like speech recognition and translation, but remain limited in modeling paralinguistic cues such as emotion. Existing approaches often treat emotion understanding as a classification problem, offering little insight into the underlying rationale behind predictions. In this work, we explore emotion reasoning, a strategy that leverages the generative capabilities of AudioLLMs to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations. To support this in multitask AudioLLMs, we introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training. This approach enables AudioLLMs to effectively learn different tasks while incorporating emotional reasoning. Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses.

Via

Access Paper or Ask Questions

Mapping Urban Villages in China: Progress and Challenges

Mar 18, 2025

Rui Cao, Wei Tu, Dongsheng Chen, Wenyu Zhang

Figure 1 for Mapping Urban Villages in China: Progress and Challenges

Figure 2 for Mapping Urban Villages in China: Progress and Challenges

Figure 3 for Mapping Urban Villages in China: Progress and Challenges

Figure 4 for Mapping Urban Villages in China: Progress and Challenges

Abstract:The shift toward high-quality urbanization has brought increased attention to the issue of "urban villages", which has become a prominent social problem in China. However, there is a lack of available geospatial data on urban villages, making it crucial to prioritize urban village mapping. In order to assess the current progress in urban village mapping and identify challenges and future directions, we have conducted a comprehensive review, which to the best of our knowledge is the first of its kind in this field. Our review begins by providing a clear context for urban villages and elaborating the method for literature review, then summarizes the study areas, data sources, and approaches used for urban village mapping in China. We also address the challenges and future directions for further research. Through thorough investigation, we find that current studies only cover very limited study areas and periods and lack sufficient investigation into the scalability, transferability, and interpretability of identification approaches due to the challenges in concept fuzziness and variances, spatial heterogeneity and variances of urban villages, and data availability. Future research can complement and further the current research in the following potential directions in order to achieve large-area mapping across the whole nation...

* Computers, Environment and Urban Systems, 119, 102282 (2025)
* Updated review at https://github.com/rui-research/urban-village-review

Via

Access Paper or Ask Questions

Bridging Domain Gaps between Pretrained Multimodal Models and Recommendations

Feb 21, 2025

Wenyu Zhang, Jie Luo, Xinming Zhang, Yuan Fang

Abstract:With the explosive growth of multimodal content online, pre-trained visual-language models have shown great potential for multimodal recommendation. However, while these models achieve decent performance when applied in a frozen manner, surprisingly, due to significant domain gaps (e.g., feature distribution discrepancy and task objective misalignment) between pre-training and personalized recommendation, adopting a joint training approach instead leads to performance worse than baseline. Existing approaches either rely on simple feature extraction or require computationally expensive full model fine-tuning, struggling to balance effectiveness and efficiency. To tackle these challenges, we propose \textbf{P}arameter-efficient \textbf{T}uning for \textbf{M}ultimodal \textbf{Rec}ommendation (\textbf{PTMRec}), a novel framework that bridges the domain gap between pre-trained models and recommendation systems through a knowledge-guided dual-stage parameter-efficient training strategy. This framework not only eliminates the need for costly additional pre-training but also flexibly accommodates various parameter-efficient tuning methods.

Via

Access Paper or Ask Questions

Fusion of Millimeter-wave Radar and Pulse Oximeter Data for Low-burden Diagnosis of Obstructive Sleep Apnea-Hypopnea Syndrome

Jan 25, 2025

Wei Wang, Zhaoxi Chen, Wenyu Zhang, Zetao Wang, Xiang Zhao, Chenyang Li, Jian Guan, Shankai Yin, Gang Li

Figure 1 for Fusion of Millimeter-wave Radar and Pulse Oximeter Data for Low-burden Diagnosis of Obstructive Sleep Apnea-Hypopnea Syndrome

Figure 2 for Fusion of Millimeter-wave Radar and Pulse Oximeter Data for Low-burden Diagnosis of Obstructive Sleep Apnea-Hypopnea Syndrome

Figure 3 for Fusion of Millimeter-wave Radar and Pulse Oximeter Data for Low-burden Diagnosis of Obstructive Sleep Apnea-Hypopnea Syndrome

Figure 4 for Fusion of Millimeter-wave Radar and Pulse Oximeter Data for Low-burden Diagnosis of Obstructive Sleep Apnea-Hypopnea Syndrome

Abstract:Objective: The aim of the study is to develop a novel method for improved diagnosis of obstructive sleep apnea-hypopnea syndrome (OSAHS) in clinical or home settings, with the focus on achieving diagnostic performance comparable to the gold-standard polysomnography (PSG) with significantly reduced monitoring burden. Methods: We propose a method using millimeter-wave radar and pulse oximeter for OSAHS diagnosis (ROSA). It contains a sleep apnea-hypopnea events (SAE) detection network, which directly predicts the temporal localization of SAE, and a sleep staging network, which predicts the sleep stages throughout the night, based on radar signals. It also fuses oxygen saturation (SpO2) information from the pulse oximeter to adjust the score of SAE detected by radar. Results: Experimental results on a real-world dataset (>800 hours of overnight recordings, 100 subjects) demonstrated high agreement (ICC=0.9870) on apnea-hypopnea index (AHI) between ROSA and PSG. ROSA also exhibited excellent diagnostic performance, exceeding 90% in accuracy across AHI diagnostic thresholds of 5, 15 and 30 events/h. Conclusion: ROSA improves diagnostic accuracy by fusing millimeter-wave radar and pulse oximeter data. It provides a reliable and low-burden solution for OSAHS diagnosis. Significance: ROSA addresses the limitations of high complexity and monitoring burden associated with traditional PSG. The high accuracy and low burden of ROSA show its potential to improve the accessibility of OSAHS diagnosis among population.

Via

Access Paper or Ask Questions

Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Jan 02, 2025

Bin Wang, Xunlong Zou, Shuo Sun, Wenyu Zhang, Yingxu He, Zhuohan Liu, Chengwei Wei, Nancy F. Chen, AiTi Aw

Figure 1 for Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Figure 2 for Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Figure 3 for Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Figure 4 for Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Abstract:Singlish, a Creole language rooted in English, is a key focus in linguistic research within multilingual and multicultural contexts. However, its spoken form remains underexplored, limiting insights into its linguistic structure and applications. To address this gap, we standardize and annotate the largest spoken Singlish corpus, introducing the Multitask National Speech Corpus (MNSC). These datasets support diverse tasks, including Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), and Paralinguistic Question Answering (PQA). We release standardized splits and a human-verified test set to facilitate further research. Additionally, we propose SingAudioLLM, a multi-task multimodal model leveraging multimodal large language models to handle these tasks concurrently. Experiments reveal our models adaptability to Singlish context, achieving state-of-the-art performance and outperforming prior models by 10-30% in comparison with other AudioLLMs and cascaded solutions.

* Open-Source: https://github.com/AudioLLMs/Singlish

Via

Access Paper or Ask Questions

SS-CTML: Self-Supervised Cross-Task Mutual Learning for CT Image Reconstruction

Dec 31, 2024

Gaofeng Chen, Yaoduo Zhang, Li Huang, Pengfei Wang, Wenyu Zhang, Dong Zeng, Jianhua Ma, Ji He

Figure 1 for SS-CTML: Self-Supervised Cross-Task Mutual Learning for CT Image Reconstruction

Figure 2 for SS-CTML: Self-Supervised Cross-Task Mutual Learning for CT Image Reconstruction

Figure 3 for SS-CTML: Self-Supervised Cross-Task Mutual Learning for CT Image Reconstruction

Figure 4 for SS-CTML: Self-Supervised Cross-Task Mutual Learning for CT Image Reconstruction

Abstract:Supervised deep-learning (SDL) techniques with paired training datasets have been widely studied for X-ray computed tomography (CT) image reconstruction. However, due to the difficulties of obtaining paired training datasets in clinical routine, the SDL methods are still away from common uses in clinical practices. In recent years, self-supervised deep-learning (SSDL) techniques have shown great potential for the studies of CT image reconstruction. In this work, we propose a self-supervised cross-task mutual learning (SS-CTML) framework for CT image reconstruction. Specifically, a sparse-view scanned and a limited-view scanned sinogram data are first extracted from a full-view scanned sinogram data, which results in three individual reconstruction tasks, i.e., the full-view CT (FVCT) reconstruction, the sparse-view CT (SVCT) reconstruction, and limited-view CT (LVCT) reconstruction. Then, three neural networks are constructed for the three reconstruction tasks. Considering that the ultimate goals of the three tasks are all to reconstruct high-quality CT images, we therefore construct a set of cross-task mutual learning objectives for the three tasks, in which way, the three neural networks can be self-supervised optimized by learning from each other. Clinical datasets are adopted to evaluate the effectiveness of the proposed framework. Experimental results demonstrate that the SS-CTML framework can obtain promising CT image reconstruction performance in terms of both quantitative and qualitative measurements.

Via

Access Paper or Ask Questions

MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models

Dec 18, 2024

Yingxu He, Zhuohan Liu, Shuo Sun, Bin Wang, Wenyu Zhang, Xunlong Zou, Nancy F. Chen, Ai Ti Aw

Figure 1 for MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models

Figure 2 for MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models

Abstract:We introduce MERaLiON-AudioLLM (Multimodal Empathetic Reasoning and Learning in One Network), the first speech-text model tailored for Singapore's multilingual and multicultural landscape. Developed under the National Large Language Models Funding Initiative, Singapore, MERaLiON-AudioLLM integrates advanced speech and text processing to address the diverse linguistic nuances of local accents and dialects, enhancing accessibility and usability in complex, multilingual environments. Our results demonstrate improvements in both speech recognition and task-specific understanding, positioning MERaLiON-AudioLLM as a pioneering solution for region specific AI applications. We envision this release to set a precedent for future models designed to address localised linguistic and cultural contexts in a global framework.

Via

Access Paper or Ask Questions

SPHERE: A Hierarchical Evaluation on Spatial Perception and Reasoning for Vision-Language Models

Dec 17, 2024

Wenyu Zhang, Wei En Ng, Lixin Ma, Yuwen Wang, Jungqi Zhao, Boyang Li, Lu Wang

Abstract:Current vision-language models may incorporate single-dimensional spatial cues, such as depth, object boundary, and basic spatial directions (e.g. left, right, front, back), yet often lack the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications. To address this gap, we develop SPHERE (Spatial Perception and Hierarchical Evaluation of REasoning), a hierarchical evaluation framework with a new human-annotated dataset to pinpoint model strengths and weaknesses, advancing from single-skill tasks to multi-skill tasks, and ultimately to complex reasoning tasks that require the integration of multiple spatial and visual cues with logical reasoning. Benchmark evaluation of state-of-the-art open-source models reveal significant shortcomings, especially in the abilities to understand distance and proximity, to reason from both allocentric and egocentric viewpoints, and to perform complex reasoning in a physical context. This work underscores the need for more advanced approaches to spatial understanding and reasoning, paving the way for improvements in vision-language models and their alignment with human-like spatial capabilities. The dataset will be open-sourced upon publication.

Via

Access Paper or Ask Questions

Importance Sampling With Stochastic Particle Flow and Diffusion Optimization

Dec 13, 2024

Wenyu Zhang, Mohammad J. Khojasteh, Nikolay A. Atanasov, Florian Meyer

Abstract:Particle flow (PFL) is an effective method for overcoming particle degeneracy, the main limitation of particle filtering. In PFL, particles are migrated towards regions of high likelihood based on the solution of a partial differential equation. Recently proposed stochastic PFL introduces a diffusion term in the ordinary differential equation (ODE) that describes particle motion. This diffusion term reduces the stiffness of the ODE and makes it possible to perform PFL with a lower number of numerical integration steps compared to traditional deterministic PFL. In this work, we introduce a general approach to perform importance sampling (IS) based on stochastic PFL. Our method makes it possible to evaluate a "flow-induced" proposal probability density function (PDF) after the parameters of a Gaussian mixture model (GMM) have been migrated by stochastic PFL. Compared to conventional stochastic PFL, the resulting processing step is asymptotically optimal. Within our method, it is possible to optimize the diffusion matrix that describes the diffusion term of the ODE to improve the accuracy-computational complexity tradeoff. Our simulation results in a highly nonlinear 3-D source localization scenario showcase a reduced stiffness of the ODE and an improved estimating accuracy compared to state-of-the-art deterministic and stochastic PFL.

Via

Access Paper or Ask Questions