Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuesong Chen

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

May 22, 2025

Xuesong Chen, Linjiang Huang, Tao Ma, Rongyao Fang, Shaoshuai Shi, Hongsheng Li

Abstract:The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.

* Accepted by CVPR 2025

Via

Access Paper or Ask Questions

GaussianPainter: Painting Point Cloud into 3D Gaussians with Normal Guidance

Dec 23, 2024

Jingqiu Zhou, Lue Fan, Xuesong Chen, Linjiang Huang, Si Liu, Hongsheng Li

Figure 1 for GaussianPainter: Painting Point Cloud into 3D Gaussians with Normal Guidance

Figure 2 for GaussianPainter: Painting Point Cloud into 3D Gaussians with Normal Guidance

Figure 3 for GaussianPainter: Painting Point Cloud into 3D Gaussians with Normal Guidance

Figure 4 for GaussianPainter: Painting Point Cloud into 3D Gaussians with Normal Guidance

Abstract:In this paper, we present GaussianPainter, the first method to paint a point cloud into 3D Gaussians given a reference image. GaussianPainter introduces an innovative feed-forward approach to overcome the limitations of time-consuming test-time optimization in 3D Gaussian splatting. Our method addresses a critical challenge in the field: the non-uniqueness problem inherent in the large parameter space of 3D Gaussian splatting. This space, encompassing rotation, anisotropic scales, and spherical harmonic coefficients, introduces the challenge of rendering similar images from substantially different Gaussian fields. As a result, feed-forward networks face instability when attempting to directly predict high-quality Gaussian fields, struggling to converge on consistent parameters for a given output. To address this issue, we propose to estimate a surface normal for each point to determine its Gaussian rotation. This strategy enables the network to effectively predict the remaining Gaussian parameters in the constrained space. We further enhance our approach with an appearance injection module, incorporating reference image appearance into Gaussian fields via a multiscale triplane representation. Our method successfully balances efficiency and fidelity in 3D Gaussian generation, achieving high-quality, diverse, and robust 3D content creation from point clouds in a single forward pass.

* To appear in AAAI 2025

Via

Access Paper or Ask Questions

SleepNetZero: Zero-Burden Zero-Shot Reliable Sleep Staging With Neural Networks Based on Ballistocardiograms

Oct 30, 2024

Shuzhen Li, Yuxin Chen, Xuesong Chen, Ruiyang Gao, Yupeng Zhang, Chao Yu, Yunfei Li, Ziyi Ye, Weijun Huang, Hongliang Yi(+2 more)

Abstract:Sleep monitoring plays a crucial role in maintaining good health, with sleep staging serving as an essential metric in the monitoring process. Traditional methods, utilizing medical sensors like EEG and ECG, can be effective but often present challenges such as unnatural user experience, complex deployment, and high costs. Ballistocardiography~(BCG), a type of piezoelectric sensor signal, offers a non-invasive, user-friendly, and easily deployable alternative for long-term home monitoring. However, reliable BCG-based sleep staging is challenging due to the limited sleep monitoring data available for BCG. A restricted training dataset prevents the model from generalization across populations. Additionally, transferring to BCG faces difficulty ensuring model robustness when migrating from other data sources. To address these issues, we introduce SleepNetZero, a zero-shot learning based approach for sleep staging. To tackle the generalization challenge, we propose a series of BCG feature extraction methods that align BCG components with corresponding respiratory, cardiac, and movement channels in PSG. This allows models to be trained on large-scale PSG datasets that are diverse in population. For the migration challenge, we employ data augmentation techniques, significantly enhancing generalizability. We conducted extensive training and testing on large datasets~(12393 records from 9637 different subjects), achieving an accuracy of 0.803 and a Cohen's Kappa of 0.718. ZeroSleepNet was also deployed in real prototype~(monitoring pads) and tested in actual hospital settings~(265 users), demonstrating an accuracy of 0.697 and a Cohen's Kappa of 0.589. To the best of our knowledge, this work represents the first known reliable BCG-based sleep staging effort and marks a significant step towards in-home health monitoring.

* 25 pages

Via

Access Paper or Ask Questions

Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

May 07, 2024

Tianze Xu, Jiajun Li, Xuesong Chen, Xinrui Yao, Shuchang Liu

Figure 1 for Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Figure 2 for Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Figure 3 for Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Figure 4 for Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Abstract:In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the generation of music, images, and other forms of artistic expression across various industries. However, researches on general multi-modal music generation model remain scarce. To fill this gap, we propose a multi-modal music generation framework Mozart's Touch. It could generate aligned music with the cross-modality inputs, such as images, videos and text. Mozart's Touch is composed of three main components: Multi-modal Captioning Module, Large Language Model (LLM) Understanding & Bridging Module, and Music Generation Module. Unlike traditional approaches, Mozart's Touch requires no training or fine-tuning pre-trained models, offering efficiency and transparency through clear, interpretable prompts. We also introduce "LLM-Bridge" method to resolve the heterogeneous representation problems between descriptive texts of different modalities. We conduct a series of objective and subjective evaluations on the proposed model, and results indicate that our model surpasses the performance of current state-of-the-art models. Our codes and examples is availble at: https://github.com/WangTooNaive/MozartsTouch

* 7 pages, 2 figures, submitted to ACM MM 2024

Via

Access Paper or Ask Questions

TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses

Jun 09, 2023

Xuesong Chen, Shaoshuai Shi, Chao Zhang, Benjin Zhu, Qiang Wang, Ka Chun Cheung, Simon See, Hongsheng Li

Figure 1 for TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses

Figure 2 for TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses

Figure 3 for TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses

Figure 4 for TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses

Abstract:3D multi-object tracking (MOT) is vital for many applications including autonomous driving vehicles and service robots. With the commonly used tracking-by-detection paradigm, 3D MOT has made important progress in recent years. However, these methods only use the detection boxes of the current frame to obtain trajectory-box association results, which makes it impossible for the tracker to recover objects missed by the detector. In this paper, we present TrajectoryFormer, a novel point-cloud-based 3D MOT framework. To recover the missed object by detector, we generates multiple trajectory hypotheses with hybrid candidate boxes, including temporally predicted boxes and current-frame detection boxes, for trajectory-box association. The predicted boxes can propagate object's history trajectory information to the current frame and thus the network can tolerate short-term miss detection of the tracked objects. We combine long-term object motion feature and short-term object appearance feature to create per-hypothesis feature embedding, which reduces the computational overhead for spatial-temporal encoding. Additionally, we introduce a Global-Local Interaction Module to conduct information interaction among all hypotheses and models their spatial relations, leading to accurate estimation of hypotheses. Our TrajectoryFormer achieves state-of-the-art performance on the Waymo 3D MOT benchmarks.

* 10 pages, 3 figures

Via

Access Paper or Ask Questions

Brain Topography Adaptive Network for Satisfaction Modeling in Interactive Information Access System

Aug 17, 2022

Ziyi Ye, Xiaohui Xie, Yiqun Liu, Zhihong Wang, Xuesong Chen, Min Zhang, Shaoping Ma

Figure 1 for Brain Topography Adaptive Network for Satisfaction Modeling in Interactive Information Access System

Figure 2 for Brain Topography Adaptive Network for Satisfaction Modeling in Interactive Information Access System

Figure 3 for Brain Topography Adaptive Network for Satisfaction Modeling in Interactive Information Access System

Figure 4 for Brain Topography Adaptive Network for Satisfaction Modeling in Interactive Information Access System

Abstract:With the growth of information on the Web, most users heavily rely on information access systems (e.g., search engines, recommender systems, etc.) in their daily lives. During this procedure, modeling users' satisfaction status plays an essential part in improving their experiences with the systems. In this paper, we aim to explore the benefits of using Electroencephalography (EEG) signals for satisfaction modeling in interactive information access system design. Different from existing EEG classification tasks, the arisen of satisfaction involves multiple brain functions, such as arousal, prototypicality, and appraisals, which are related to different brain topographical areas. Thus modeling user satisfaction raises great challenges to existing solutions. To address this challenge, we propose BTA, a Brain Topography Adaptive network with a multi-centrality encoding module and a spatial attention mechanism module to capture cognitive connectivities in different spatial distances. We explore the effectiveness of BTA for satisfaction modeling in two popular information access scenarios, i.e., search and recommendation. Extensive experiments on two real-world datasets verify the effectiveness of introducing brain topography adaptive strategy in satisfaction modeling. Furthermore, we also conduct search result re-ranking task and video rating prediction task based on the satisfaction inferred from brain signals on search and recommendation scenarios, respectively. Experimental results show that brain signals extracted with BTA help improve the performance of interactive information access systems significantly.

* Accepted by Multimedia 2022 (MM'22) as a full paper

Via

Access Paper or Ask Questions

MPPNet: Multi-Frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection

May 12, 2022

Xuesong Chen, Shaoshuai Shi, Benjin Zhu, Ka Chun Cheung, Hang Xu, Hongsheng Li

Figure 1 for MPPNet: Multi-Frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection

Figure 2 for MPPNet: Multi-Frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection

Figure 3 for MPPNet: Multi-Frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection

Figure 4 for MPPNet: Multi-Frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection

Abstract:Accurate and reliable 3D detection is vital for many applications including autonomous driving vehicles and service robots. In this paper, we present a flexible and high-performance 3D detection framework, named MPPNet, for 3D temporal object detection with point cloud sequences. We propose a novel three-hierarchy framework with proxy points for multi-frame feature encoding and interactions to achieve better detection. The three hierarchies conduct per-frame feature encoding, short-clip feature fusion, and whole-sequence feature aggregation, respectively. To enable processing long-sequence point clouds with reasonable computational resources, intra-group feature mixing and inter-group feature attention are proposed to form the second and third feature encoding hierarchies, which are recurrently applied for aggregating multi-frame trajectory features. The proxy points not only act as consistent object representations for each frame, but also serve as the courier to facilitate feature interaction between frames. The experiments on largeWaymo Open dataset show that our approach outperforms state-of-the-art methods with large margins when applied to both short (e.g., 4-frame) and long (e.g., 16-frame) point cloud sequences. Specifically, MPPNet achieves 74.21%, 74.62% and 73.31% for vehicle, pedestrian and cyclist classes on the LEVEL 2 mAPH metric with 16-frame input.

* 17 pages, 2 figures

Via

Access Paper or Ask Questions

Web Search via an Efficient and Effective Brain-Machine Interface

Oct 15, 2021

Xuesong Chen, Ziyi Ye, Xiaohui Xie, Yiqun Liu, Weihang Su, Shuqi Zhu, Min Zhang, Shaoping Ma

Figure 1 for Web Search via an Efficient and Effective Brain-Machine Interface

Figure 2 for Web Search via an Efficient and Effective Brain-Machine Interface

Figure 3 for Web Search via an Efficient and Effective Brain-Machine Interface

Abstract:While search technologies have evolved to be robust and ubiquitous, the fundamental interaction paradigm has remained relatively stable for decades. With the maturity of the Brain-Machine Interface, we build an efficient and effective communication system between human beings and search engines based on electroencephalogram(EEG) signals, called Brain-Machine Search Interface(BMSI) system. The BMSI system provides functions including query reformulation and search result interaction. In our system, users can perform search tasks without having to use the mouse and keyboard. Therefore, it is useful for application scenarios in which hand-based interactions are infeasible, e.g, for users with severe neuromuscular disorders. Besides, based on brain signals decoding, our system can provide abundant and valuable user-side context information(e.g., real-time satisfaction feedback, extensive context information, and a clearer description of information needs) to the search engine, which is hard to capture in the previous paradigm. In our implementation, the system can decode user satisfaction from brain signals in real-time during the interaction process and re-rank the search results list based on user satisfaction feedback. The demo video is available at http://www.thuir.cn/group/YQLiu/datasets/BMSISystem.mp4.

Via

Access Paper or Ask Questions

Why Don't You Click: Neural Correlates of Non-Click Behaviors in Web Search

Sep 22, 2021

Ziyi Ye, Xiaohui Xie, Yiqun Liu, Zhihong Wang, Xuancheng Li, Jiaji Li, Xuesong Chen, Min Zhang, Shaoping Ma

Figure 1 for Why Don't You Click: Neural Correlates of Non-Click Behaviors in Web Search

Figure 2 for Why Don't You Click: Neural Correlates of Non-Click Behaviors in Web Search

Figure 3 for Why Don't You Click: Neural Correlates of Non-Click Behaviors in Web Search

Figure 4 for Why Don't You Click: Neural Correlates of Non-Click Behaviors in Web Search

Abstract:Web search heavily relies on click-through behavior as an essential feedback signal for performance improvement and evaluation. Traditionally, click is usually treated as a positive implicit feedback signal of relevance or usefulness, while non-click (especially non-click after examination) is regarded as a signal of irrelevance or uselessness. However, there are many cases where users do not click on any search results but still satisfy their information need with the contents of the results shown on the Search Engine Result Page (SERP). This raises the problem of measuring result usefulness and modeling user satisfaction in "Zero-click" search scenarios. Previous works have solved this issue by (1) detecting user satisfaction for abandoned SERP with context information and (2) considering result-level click necessity with external assessors' annotations. However, few works have investigated the reason behind non-click behavior and estimated the usefulness of non-click results. A challenge for this research question is how to collect valuable feedback for non-click results. With neuroimaging technologies, we design a lab-based user study and reveal differences in brain signals while examining non-click search results with different usefulness levels. The findings in significant brain regions and electroencephalogram~(EEG) spectrum also suggest that the process of usefulness judgment might involve similar cognitive functions of relevance perception and satisfaction decoding. Inspired by these findings, we conduct supervised learning tasks to estimate the usefulness of non-click results with brain signals and conventional information (i.e., content and context factors). Results show that it is feasible to utilize brain signals to improve usefulness estimation performance and enhancing human-computer interactions in "Zero-click" search scenarios.

Via

Access Paper or Ask Questions

Understanding Human Reading Comprehension with Brain Signals

Aug 04, 2021

Ziyi Ye, Xiaohui Xie, Yiqun Liu, Zhihong Wang, Xuesong Chen, Min Zhang, Shaoping Ma

Figure 1 for Understanding Human Reading Comprehension with Brain Signals

Figure 2 for Understanding Human Reading Comprehension with Brain Signals

Figure 3 for Understanding Human Reading Comprehension with Brain Signals

Figure 4 for Understanding Human Reading Comprehension with Brain Signals

Abstract:Reading comprehension is a complex cognitive process involving many human brain activities. Plenty of works have studied the reading patterns and attention allocation mechanisms in the reading process. However, little is known about what happens in human brain during reading comprehension and how we can utilize this information as implicit feedback to facilitate information acquisition performance. With the advances in brain imaging techniques such as EEG, it is possible to collect high-precision brain signals in almost real time. With neuroimaging techniques, we carefully design a lab-based user study to investigate brain activities during reading comprehension. Our findings show that neural responses vary with different types of contents, i.e., contents that can satisfy users' information needs and contents that cannot. We suggest that various cognitive activities, e.g., cognitive loading, semantic-thematic understanding, and inferential processing, at the micro-time scale during reading comprehension underpin these neural responses. Inspired by these detectable differences in cognitive activities, we construct supervised learning models based on EEG features for two reading comprehension tasks: answer sentence classification and answer extraction. Results show that it is feasible to improve their performance with brain signals. These findings imply that brain signals are valuable feedback for enhancing human-computer interactions during reading comprehension.

Via

Access Paper or Ask Questions