Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haotian Shi

V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling

Jun 26, 2025

Junwei You, Pei Li, Zhuoyu Jiang, Zilin Huang, Rui Gan, Haotian Shi, Bin Ran

Abstract:Ensuring robust planning and decision-making under rare, diverse, and visually degraded long-tail scenarios remains a fundamental challenge for autonomous driving in urban environments. This issue becomes more critical in cooperative settings, where vehicles and infrastructure jointly perceive and reason across complex environments. To address this challenge, we propose V2X-REALM, a vision-language model (VLM)-based framework with adaptive multimodal learning for robust cooperative autonomous driving under long-tail scenarios. V2X-REALM introduces three core innovations: (i) a prompt-driven long-tail scenario generation and evaluation pipeline that leverages foundation models to synthesize realistic long-tail conditions such as snow and fog across vehicle- and infrastructure-side views, enriching training diversity efficiently; (ii) a gated multi-scenario adaptive attention module that modulates the visual stream using scenario priors to recalibrate ambiguous or corrupted features; and (iii) a multi-task scenario-aware contrastive learning objective that improves multimodal alignment and promotes cross-scenario feature separability. Extensive experiments demonstrate that V2X-REALM significantly outperforms existing baselines in robustness, semantic reasoning, safety, and planning accuracy under complex, challenging driving conditions, advancing the scalability of end-to-end cooperative autonomous driving.

Via

Access Paper or Ask Questions

Towards Robust Multi-UAV Collaboration: MARL with Noise-Resilient Communication and Attention Mechanisms

Mar 04, 2025

Zilin Zhao, Chishui Chen, Haotian Shi, Jiale Chen, Xuanlin Yue, Zhejian Yang, Yang Liu

Abstract:Efficient path planning for unmanned aerial vehicles (UAVs) is crucial in remote sensing and information collection. As task scales expand, the cooperative deployment of multiple UAVs significantly improves information collection efficiency. However, collaborative communication and decision-making for multiple UAVs remain major challenges in path planning, especially in noisy environments. To efficiently accomplish complex information collection tasks in 3D space and address robust communication issues, we propose a multi-agent reinforcement learning (MARL) framework for UAV path planning based on the Counterfactual Multi-Agent Policy Gradients (COMA) algorithm. The framework incorporates attention mechanism-based UAV communication protocol and training-deployment system, significantly improving communication robustness and individual decision-making capabilities in noisy conditions. Experiments conducted on both synthetic and real-world datasets demonstrate that our method outperforms existing algorithms in terms of path planning efficiency and robustness, especially in noisy environments, achieving a 78\% improvement in entropy reduction.

Via

Access Paper or Ask Questions

Online Adaptive Platoon Control for Connected and Automated Vehicles via Physics Enhanced Residual Learning

Dec 30, 2024

Peng Zhang, Heye Huang, Hang Zhou, Haotian Shi, Keke Long, Xiaopeng Li

Abstract:This paper introduces a physics enhanced residual learning (PERL) framework for connected and automated vehicle (CAV) platoon control, addressing the dynamics and unpredictability inherent to platoon systems. The framework first develops a physics-based controller to model vehicle dynamics, using driving speed as input to optimize safety and efficiency. Then the residual controller, based on neural network (NN) learning, enriches the prior knowledge of the physical model and corrects residuals caused by vehicle dynamics. By integrating the physical model with data-driven online learning, the PERL framework retains the interpretability and transparency of physics-based models and enhances the adaptability and precision of data-driven learning, achieving significant improvements in computational efficiency and control accuracy in dynamic scenarios. Simulation and robot car platform tests demonstrate that PERL significantly outperforms pure physical and learning models, reducing average cumulative absolute position and speed errors by up to 58.5% and 40.1% (physical model) and 58.4% and 47.7% (NN model). The reduced-scale robot car platform tests further validate the adaptive PERL framework's superior accuracy and rapid convergence under dynamic disturbances, reducing position and speed cumulative errors by 72.73% and 99.05% (physical model) and 64.71% and 72.58% (NN model). PERL enhances platoon control performance through online parameter updates when external disturbances are detected. Results demonstrate the advanced framework's exceptional accuracy and rapid convergence capabilities, proving its effectiveness in maintaining platoon stability under diverse conditions.

* 25 pages, 12 figures

Via

Access Paper or Ask Questions

FollowGen: A Scaled Noise Conditional Diffusion Model for Car-Following Trajectory Prediction

Nov 23, 2024

Junwei You, Rui Gan, Weizhe Tang, Zilin Huang, Jiaxi Liu, Zhuoyu Jiang, Haotian Shi, Keshu Wu, Keke Long, Sicheng Fu(+2 more)

Figure 1 for FollowGen: A Scaled Noise Conditional Diffusion Model for Car-Following Trajectory Prediction

Figure 2 for FollowGen: A Scaled Noise Conditional Diffusion Model for Car-Following Trajectory Prediction

Figure 3 for FollowGen: A Scaled Noise Conditional Diffusion Model for Car-Following Trajectory Prediction

Figure 4 for FollowGen: A Scaled Noise Conditional Diffusion Model for Car-Following Trajectory Prediction

Abstract:Vehicle trajectory prediction is crucial for advancing autonomous driving and advanced driver assistance systems (ADAS). Although deep learning-based approaches - especially those utilizing transformer-based and generative models - have markedly improved prediction accuracy by capturing complex, non-linear patterns in vehicle dynamics and traffic interactions, they frequently overlook detailed car-following behaviors and the inter-vehicle interactions critical for real-world driving applications, particularly in fully autonomous or mixed traffic scenarios. To address the issue, this study introduces a scaled noise conditional diffusion model for car-following trajectory prediction, which integrates detailed inter-vehicular interactions and car-following dynamics into a generative framework, improving both the accuracy and plausibility of predicted trajectories. The model utilizes a novel pipeline to capture historical vehicle dynamics by scaling noise with encoded historical features within the diffusion process. Particularly, it employs a cross-attention-based transformer architecture to model intricate inter-vehicle dependencies, effectively guiding the denoising process and enhancing prediction accuracy. Experimental results on diverse real-world driving scenarios demonstrate the state-of-the-art performance and robustness of the proposed method.

* arXiv admin note: text overlap with arXiv:2406.11941

Via

Access Paper or Ask Questions

Hypergraph-based Motion Generation with Multi-modal Interaction Relational Reasoning

Sep 18, 2024

Keshu Wu, Yang Zhou, Haotian Shi, Dominique Lord, Bin Ran, Xinyue Ye

Abstract:The intricate nature of real-world driving environments, characterized by dynamic and diverse interactions among multiple vehicles and their possible future states, presents considerable challenges in accurately predicting the motion states of vehicles and handling the uncertainty inherent in the predictions. Addressing these challenges requires comprehensive modeling and reasoning to capture the implicit relations among vehicles and the corresponding diverse behaviors. This research introduces an integrated framework for autonomous vehicles (AVs) motion prediction to address these complexities, utilizing a novel Relational Hypergraph Interaction-informed Neural mOtion generator (RHINO). RHINO leverages hypergraph-based relational reasoning by integrating a multi-scale hypergraph neural network to model group-wise interactions among multiple vehicles and their multi-modal driving behaviors, thereby enhancing motion prediction accuracy and reliability. Experimental validation using real-world datasets demonstrates the superior performance of this framework in improving predictive accuracy and fostering socially aware automated driving in dynamic traffic scenarios.

Via

Access Paper or Ask Questions

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

Aug 17, 2024

Junwei You, Haotian Shi, Zhuoyu Jiang, Zilin Huang, Rui Gan, Keshu Wu, Xi Cheng, Xiaopeng Li, Bin Ran

Figure 1 for V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

Figure 2 for V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

Figure 3 for V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

Figure 4 for V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

Abstract:Advancements in autonomous driving have increasingly focused on end-to-end (E2E) systems that manage the full spectrum of driving tasks, from environmental perception to vehicle navigation and control. This paper introduces V2X-VLM, an innovative E2E vehicle-infrastructure cooperative autonomous driving (VICAD) framework with large vision-language models (VLMs). V2X-VLM is designed to enhance situational awareness, decision-making, and ultimate trajectory planning by integrating data from vehicle-mounted cameras, infrastructure sensors, and textual information. The strength of the comprehensive multimodel data fusion of the VLM enables precise and safe E2E trajectory planning in complex and dynamic driving scenarios. Validation on the DAIR-V2X dataset demonstrates that V2X-VLM outperforms existing state-of-the-art methods in cooperative autonomous driving.

Via

Access Paper or Ask Questions

VLM-MPC: Vision Language Foundation Model (VLM)-Guided Model Predictive Controller (MPC) for Autonomous Driving

Aug 09, 2024

Keke Long, Haotian Shi, Jiaxi Liu, Xiaopeng Li

Abstract:Motivated by the emergent reasoning capabilities of Vision Language Models (VLMs) and its potential to improve the comprehensibility of autonomous driving systems, this paper introduces a closed-loop autonomous driving controller called VLM-MPC, which combines a VLM for high-level decision-making and a Model Predictive Controller (MPC) for low-level vehicle control. The proposed VLM-MPC system is structurally divided into two asynchronous components: an upper-level VLM and a lower-level MPC. The upper layer VLM generates driving parameters for lower-level control based on front camera images, ego vehicle state, traffic environment conditions, and reference memory. The lower-level MPC controls the vehicle in real-time using these parameters, considering engine lag and providing state feedback to the entire system. Experiments based on the nuScenes dataset validated the effectiveness of the proposed VLM-MPC system across various scenarios (e.g., night, rain, intersections). Results showed that the VLM-MPC system consistently outperformed baseline models in terms of safety and driving comfort. By comparing behaviors under different weather conditions and scenarios, we demonstrated the VLM's ability to understand the environment and make reasonable inferences.

Via

Access Paper or Ask Questions

Physically Analyzable AI-Based Nonlinear Platoon Dynamics Modeling During Traffic Oscillation: A Koopman Approach

Jun 20, 2024

Kexin Tian, Haotian Shi, Yang Zhou, Sixu Li

Abstract:Given the complexity and nonlinearity inherent in traffic dynamics within vehicular platoons, there exists a critical need for a modeling methodology with high accuracy while concurrently achieving physical analyzability. Currently, there are two predominant approaches: the physics model-based approach and the Artificial Intelligence (AI)--based approach. Knowing the facts that the physical-based model usually lacks sufficient modeling accuracy and potential function mismatches and the pure-AI-based method lacks analyzability, this paper innovatively proposes an AI-based Koopman approach to model the unknown nonlinear platoon dynamics harnessing the power of AI and simultaneously maintain physical analyzability, with a particular focus on periods of traffic oscillation. Specifically, this research first employs a deep learning framework to generate the embedding function that lifts the original space into the embedding space. Given the embedding space descriptiveness, the platoon dynamics can be expressed as a linear dynamical system founded by the Koopman theory. Based on that, the routine of linear dynamical system analysis can be conducted on the learned traffic linear dynamics in the embedding space. By that, the physical interpretability and analyzability of model-based methods with the heightened precision inherent in data-driven approaches can be synergized. Comparative experiments have been conducted with existing modeling approaches, which suggests our method's superiority in accuracy. Additionally, a phase plane analysis is performed, further evidencing our approach's effectiveness in replicating the complex dynamic patterns. Moreover, the proposed methodology is proven to feature the capability of analyzing the stability, attesting to the physical analyzability.

Via

Access Paper or Ask Questions

Crossfusor: A Cross-Attention Transformer Enhanced Conditional Diffusion Model for Car-Following Trajectory Prediction

Jun 17, 2024

Junwei You, Haotian Shi, Keshu Wu, Keke Long, Sicheng Fu, Sikai Chen, Bin Ran

Figure 1 for Crossfusor: A Cross-Attention Transformer Enhanced Conditional Diffusion Model for Car-Following Trajectory Prediction

Figure 2 for Crossfusor: A Cross-Attention Transformer Enhanced Conditional Diffusion Model for Car-Following Trajectory Prediction

Figure 3 for Crossfusor: A Cross-Attention Transformer Enhanced Conditional Diffusion Model for Car-Following Trajectory Prediction

Figure 4 for Crossfusor: A Cross-Attention Transformer Enhanced Conditional Diffusion Model for Car-Following Trajectory Prediction

Abstract:Vehicle trajectory prediction is crucial for advancing autonomous driving and advanced driver assistance systems (ADAS), enhancing road safety and traffic efficiency. While traditional methods have laid foundational work, modern deep learning techniques, particularly transformer-based models and generative approaches, have significantly improved prediction accuracy by capturing complex and non-linear patterns in vehicle motion and traffic interactions. However, these models often overlook the detailed car-following behaviors and inter-vehicle interactions essential for real-world driving scenarios. This study introduces a Cross-Attention Transformer Enhanced Conditional Diffusion Model (Crossfusor) specifically designed for car-following trajectory prediction. Crossfusor integrates detailed inter-vehicular interactions and car-following dynamics into a robust diffusion framework, improving both the accuracy and realism of predicted trajectories. The model leverages a novel temporal feature encoding framework combining GRU, location-based attention mechanisms, and Fourier embedding to capture historical vehicle dynamics. It employs noise scaled by these encoded historical features in the forward diffusion process, and uses a cross-attention transformer to model intricate inter-vehicle dependencies in the reverse denoising process. Experimental results on the NGSIM dataset demonstrate that Crossfusor outperforms state-of-the-art models, particularly in long-term predictions, showcasing its potential for enhancing the predictive capabilities of autonomous driving systems.

Via

Access Paper or Ask Questions

Optimizing Bus Travel: A Novel Approach to Feature Mining with P-KMEANS and P-LDA Algorithms

Dec 04, 2023

Hongjie Liu, Haotian Shi, Sicheng Fu, Tengfei Yuan, Xinhuan Zhang, Hongzhe Xu, Bin Ran

Figure 1 for Optimizing Bus Travel: A Novel Approach to Feature Mining with P-KMEANS and P-LDA Algorithms

Figure 2 for Optimizing Bus Travel: A Novel Approach to Feature Mining with P-KMEANS and P-LDA Algorithms

Figure 3 for Optimizing Bus Travel: A Novel Approach to Feature Mining with P-KMEANS and P-LDA Algorithms

Figure 4 for Optimizing Bus Travel: A Novel Approach to Feature Mining with P-KMEANS and P-LDA Algorithms

Abstract:Customizing services for bus travel can bolster its attractiveness, optimize usage, alleviate traffic congestion, and diminish carbon emissions. This potential is realized by harnessing recent advancements in positioning communication facilities, the Internet of Things, and artificial intelligence for feature mining in public transportation. However, the inherent complexities of disorganized and unstructured public transportation data introduce substantial challenges to travel feature extraction. This study presents a bus travel feature extraction method rooted in Point of Interest (POI) data, employing enhanced P-KMENAS and P-LDA algorithms to overcome these limitations. While the KMEANS algorithm adeptly segments passenger travel paths into distinct clusters, its outcomes can be influenced by the initial K value. On the other hand, Latent Dirichlet Allocation (LDA) excels at feature identification and probabilistic interpretations yet encounters difficulties with feature intermingling and nuanced sub-feature interactions. Incorporating the POI dimension enhances our understanding of travel behavior, aligning it more closely with passenger attributes and facilitating easier data analysis. By incorporating POI data, our refined P-KMENAS and P-LDA algorithms grant a holistic insight into travel behaviors and attributes, effectively mitigating the limitations above. Consequently, this POI-centric algorithm effectively amalgamates diverse POI attributes, delineates varied travel contexts, and imparts probabilistic metrics to feature properties. Our method successfully mines the diverse aspects of bus travel, such as age, occupation, gender, sports, cost, safety, and personality traits. It effectively calculates relationships between individual travel behaviors and assigns explanatory and evaluative probabilities to POI labels, thereby enhancing bus travel optimization.

Via

Access Paper or Ask Questions