Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jinyu Miao

AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

May 21, 2025

Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao(+12 more)

Abstract:Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework that, for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's core innovations include: \textbf{(i) Structured Data Generation}, by establishing an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline}, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and \textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel multi-tool assessment protocol to rigorously evaluate the model's tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate AgentThink significantly boosts overall reasoning scores by \textbf{53.91\%} and enhances answer accuracy by \textbf{33.54\%}, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models.

* 18 pages, 8 figures

Via

Access Paper or Ask Questions

FASIONAD++ : Integrating High-Level Instruction and Information Bottleneck in FAt-Slow fusION Systems for Enhanced Safety in Autonomous Driving with Adaptive Feedback

Mar 11, 2025

Kangan Qian, Ziang Luo, Sicong Jiang, Zilin Huang, Jinyu Miao, Zhikun Ma, Tianze Zhu, Jiayin Li, Yangfan He, Zheng Fu(+9 more)

Abstract:Ensuring safe, comfortable, and efficient planning is crucial for autonomous driving systems. While end-to-end models trained on large datasets perform well in standard driving scenarios, they struggle with complex low-frequency events. Recent Large Language Models (LLMs) and Vision Language Models (VLMs) advancements offer enhanced reasoning but suffer from computational inefficiency. Inspired by the dual-process cognitive model "Thinking, Fast and Slow", we propose $\textbf{FASIONAD}$ -- a novel dual-system framework that synergizes a fast end-to-end planner with a VLM-based reasoning module. The fast system leverages end-to-end learning to achieve real-time trajectory generation in common scenarios, while the slow system activates through uncertainty estimation to perform contextual analysis and complex scenario resolution. Our architecture introduces three key innovations: (1) A dynamic switching mechanism enabling slow system intervention based on real-time uncertainty assessment; (2) An information bottleneck with high-level plan feedback that optimizes the slow system's guidance capability; (3) A bidirectional knowledge exchange where visual prompts enhance the slow system's reasoning while its feedback refines the fast planner's decision-making. To strengthen VLM reasoning, we develop a question-answering mechanism coupled with reward-instruct training strategy. In open-loop experiments, FASIONAD achieves a $6.7\%$ reduction in average $L2$ trajectory error and $28.1\%$ lower collision rate.

* 8 pages, 4 figures

Via

Access Paper or Ask Questions

LEGO-Motion: Learning-Enhanced Grids with Occupancy Instance Modeling for Class-Agnostic Motion Prediction

Mar 10, 2025

Kangan Qian, Jinyu Miao, Ziang Luo, Zheng Fu, and Jinchen Li, Yining Shi, Yunlong Wang, Kun Jiang, Mengmeng Yang, Diange Yang

Abstract:Accurate and reliable spatial and motion information plays a pivotal role in autonomous driving systems. However, object-level perception models struggle with handling open scenario categories and lack precise intrinsic geometry. On the other hand, occupancy-based class-agnostic methods excel in representing scenes but fail to ensure physics consistency and ignore the importance of interactions between traffic participants, hindering the model's ability to learn accurate and reliable motion. In this paper, we introduce a novel occupancy-instance modeling framework for class-agnostic motion prediction tasks, named LEGO-Motion, which incorporates instance features into Bird's Eye View (BEV) space. Our model comprises (1) a BEV encoder, (2) an Interaction-Augmented Instance Encoder, and (3) an Instance-Enhanced BEV Encoder, improving both interaction relationships and physics consistency within the model, thereby ensuring a more accurate and robust understanding of the environment. Extensive experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches. Furthermore, the effectiveness of our framework is validated on the advanced FMCW LiDAR benchmark, showcasing its practical applicability and generalization capabilities. The code will be made publicly available to facilitate further research.

* 8 pages, 4 figures

Via

Access Paper or Ask Questions

Residual Learning towards High-fidelity Vehicle Dynamics Modeling with Transformer

Feb 17, 2025

Jinyu Miao, Rujun Yan, Bowei Zhang, Tuopu Wen, Kun Jiang, Mengmeng Yang, Jin Huang, Zhihua Zhong, Diange Yang

Abstract:The vehicle dynamics model serves as a vital component of autonomous driving systems, as it describes the temporal changes in vehicle state. In a long period, researchers have made significant endeavors to accurately model vehicle dynamics. Traditional physics-based methods employ mathematical formulae to model vehicle dynamics, but they are unable to adequately describe complex vehicle systems due to the simplifications they entail. Recent advancements in deep learning-based methods have addressed this limitation by directly regressing vehicle dynamics. However, the performance and generalization capabilities still require further enhancement. In this letter, we address these problems by proposing a vehicle dynamics correction system that leverages deep neural networks to correct the state residuals of a physical model instead of directly estimating the states. This system greatly reduces the difficulty of network learning and thus improves the estimation accuracy of vehicle dynamics. Furthermore, we have developed a novel Transformer-based dynamics residual correction network, DyTR. This network implicitly represents state residuals as high-dimensional queries, and iteratively updates the estimated residuals by interacting with dynamics state features. The experiments in simulations demonstrate the proposed system works much better than physics model, and our proposed DyTR model achieves the best performances on dynamics state residual correction task, reducing the state prediction errors of a simple 3 DoF vehicle model by an average of 92.3% and 59.9% in two dataset, respectively.

* 8 pages, 4 figures, 5 tables

Via

Access Paper or Ask Questions

Autonomous Driving in Unstructured Environments: How Far Have We Come?

Oct 10, 2024

Chen Min, Shubin Si, Xu Wang, Hanzhang Xue, Weizhong Jiang, Yang Liu, Juan Wang, Qingtian Zhu, Qi Zhu, Lun Luo(+27 more)

Figure 1 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Figure 2 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Figure 3 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Figure 4 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Abstract:Research on autonomous driving in unstructured outdoor environments is less advanced than in structured urban settings due to challenges like environmental diversities and scene complexity. These environments-such as rural areas and rugged terrains-pose unique obstacles that are not common in structured urban areas. Despite these difficulties, autonomous driving in unstructured outdoor environments is crucial for applications in agriculture, mining, and military operations. Our survey reviews over 250 papers for autonomous driving in unstructured outdoor environments, covering offline mapping, pose estimation, environmental perception, path planning, end-to-end autonomous driving, datasets, and relevant challenges. We also discuss emerging trends and future research directions. This review aims to consolidate knowledge and encourage further research for autonomous driving in unstructured environments. To support ongoing work, we maintain an active repository with up-to-date literature and open-source projects at: https://github.com/chaytonmin/Survey-Autonomous-Driving-in-Unstructured-Environments.

* Survey paper; 38 pages

Via

Access Paper or Ask Questions

A Survey on Monocular Re-Localization: From the Perspective of Scene Map Representation

Nov 27, 2023

Jinyu Miao, Kun Jiang, Tuopu Wen, Yunlong Wang, Peijing Jia, Xuhe Zhao, Zhongyang Xiao, Jin Huang, Zhihua Zhong, Diange Yang

Abstract:Monocular Re-Localization (MRL) is a critical component in numerous autonomous applications, which estimates 6 degree-of-freedom poses with regards to the scene map based on a single monocular image. In recent decades, significant progress has been made in the development of MRL techniques. Numerous landmark algorithms have accomplished extraordinary success in terms of localization accuracy and robustness against visual interference. In MRL research, scene maps are represented in various forms, and they determine how MRL methods work and even how MRL methods perform. However, to the best of our knowledge, existing surveys do not provide systematic reviews of MRL from the respective of map. This survey fills the gap by comprehensively reviewing MRL methods employing monocular cameras as main sensors, promoting further research. 1) We commence by delving into the problem definition of MRL and exploring current challenges, while also comparing ours with with previous published surveys. 2) MRL methods are then categorized into five classes according to the representation forms of utilized map, i.e., geo-tagged frames, visual landmarks, point clouds, and vectorized semantic map, and we review the milestone MRL works of each category. 3) To quantitatively and fairly compare MRL methods with various map, we also review some public datasets and provide the performances of some typical MRL methods. The strengths and weakness of different types of MRL methods are analyzed. 4) We finally introduce some topics of interest in this field and give personal opinions. This survey can serve as a valuable referenced materials for newcomers and researchers interested in MRL, and a continuously updated summary of this survey, including reviewed papers and datasets, is publicly available to the community at: https://github.com/jinyummiao/map-in-mono-reloc.

* 108 pages, 9 tables, 17 figures, under review

Via

Access Paper or Ask Questions

Poses as Queries: Image-to-LiDAR Map Localization with Transformers

May 07, 2023

Jinyu Miao, Kun Jiang, Yunlong Wang, Tuopu Wen, Zhongyang Xiao, Zheng Fu, Mengmeng Yang, Maolin Liu, Diange Yang

Figure 1 for Poses as Queries: Image-to-LiDAR Map Localization with Transformers

Figure 2 for Poses as Queries: Image-to-LiDAR Map Localization with Transformers

Figure 3 for Poses as Queries: Image-to-LiDAR Map Localization with Transformers

Figure 4 for Poses as Queries: Image-to-LiDAR Map Localization with Transformers

Abstract:High-precision vehicle localization with commercial setups is a crucial technique for high-level autonomous driving tasks. Localization with a monocular camera in LiDAR map is a newly emerged approach that achieves promising balance between cost and accuracy, but estimating pose by finding correspondences between such cross-modal sensor data is challenging, thereby damaging the localization accuracy. In this paper, we address the problem by proposing a novel Transformer-based neural network to register 2D images into 3D LiDAR map in an end-to-end manner. Poses are implicitly represented as high-dimensional feature vectors called pose queries and can be iteratively updated by interacting with the retrieved relevant information from cross-model features using attention mechanism in a proposed POse Estimator Transformer (POET) module. Moreover, we apply a multiple hypotheses aggregation method that estimates the final poses by performing parallel optimization on multiple randomly initialized pose queries to reduce the network uncertainty. Comprehensive analysis and experimental results on public benchmark conclude that the proposed image-to-LiDAR map localization network could achieve state-of-the-art performances in challenging cross-modal localization tasks.

* 8 pages, 3 figures, 4 tables

Via

Access Paper or Ask Questions

Real-time Local Feature with Global Visual Information Enhancement

Nov 20, 2022

Jinyu Miao, Haosong Yue, Zhong Liu, Xingming Wu, Zaojun Fang, Guilin Yang

Figure 1 for Real-time Local Feature with Global Visual Information Enhancement

Figure 2 for Real-time Local Feature with Global Visual Information Enhancement

Figure 3 for Real-time Local Feature with Global Visual Information Enhancement

Figure 4 for Real-time Local Feature with Global Visual Information Enhancement

Abstract:Local feature provides compact and invariant image representation for various visual tasks. Current deep learning-based local feature algorithms always utilize convolution neural network (CNN) architecture with limited receptive field. Besides, even with high-performance GPU devices, the computational efficiency of local features cannot be satisfactory. In this paper, we tackle such problems by proposing a CNN-based local feature algorithm. The proposed method introduces a global enhancement module to fuse global visual clues in a light-weight network, and then optimizes the network by novel deep reinforcement learning scheme from the perspective of local feature matching task. Experiments on the public benchmarks demonstrate that the proposal can achieve considerable robustness against visual interference and meanwhile run in real time.

* 6 pages, 5 figures, 2 tables. Accepted by ICIEA 2022

Via

Access Paper or Ask Questions

ALIKE: Accurate and Lightweight Keypoint Detection and Descriptor Extraction

Dec 06, 2021

Xiaoming Zhao, Xingming Wu, Jinyu Miao, Weihai Chen, Peter C. Y. Chen, Zhengguo Li

Figure 1 for ALIKE: Accurate and Lightweight Keypoint Detection and Descriptor Extraction

Figure 2 for ALIKE: Accurate and Lightweight Keypoint Detection and Descriptor Extraction

Figure 3 for ALIKE: Accurate and Lightweight Keypoint Detection and Descriptor Extraction

Figure 4 for ALIKE: Accurate and Lightweight Keypoint Detection and Descriptor Extraction

Abstract:Existing methods detect the keypoints in a non-differentiable way, therefore they can not directly optimize the position of keypoints through back-propagation. To address this issue, we present a differentiable keypoint detection module, which outputs accurate sub-pixel keypoints. The reprojection loss is then proposed to directly optimize these sub-pixel keypoints, and the dispersity peak loss is presented for accurate keypoints regularization. We also extract the descriptors in a sub-pixel way, and they are trained with the stable neural reprojection error loss. Moreover, a lightweight network is designed for keypoint detection and descriptor extraction, which can run at 95 frames per second for 640x480 images on a commercial GPU. On homography estimation, camera pose estimation, and visual (re-)localization tasks, the proposed method achieves equivalent performance with the state-of-the-art approaches, while greatly reduces the inference time.

* 10 pages, 10 figures

Via

Access Paper or Ask Questions

Automatic Vocabulary and Graph Verification for Accurate Loop Closure Detection

Jul 30, 2021

Haosong Yue, Jinyu Miao, Weihai Chen, Wei Wang, Fanghong Guo, Zhengguo Li

Figure 1 for Automatic Vocabulary and Graph Verification for Accurate Loop Closure Detection

Figure 2 for Automatic Vocabulary and Graph Verification for Accurate Loop Closure Detection

Figure 3 for Automatic Vocabulary and Graph Verification for Accurate Loop Closure Detection

Figure 4 for Automatic Vocabulary and Graph Verification for Accurate Loop Closure Detection

Abstract:Localizing pre-visited places during long-term simultaneous localization and mapping, i.e. loop closure detection (LCD), is a crucial technique to correct accumulated inconsistencies. As one of the most effective and efficient solutions, Bag-of-Words (BoW) builds a visual vocabulary to associate features and then detect loops. Most existing approaches that build vocabularies off-line determine scales of the vocabulary by trial-and-error, which often results in unreasonable feature association. Moreover, the accuracy of the algorithm usually declines due to perceptual aliasing, as the BoW-based method ignores the positions of visual features. To overcome these disadvantages, we propose a natural convergence criterion based on the comparison between the radii of nodes and the drifts of feature descriptors, which is then utilized to build the optimal vocabulary automatically. Furthermore, we present a novel topological graph verification method for validating candidate loops so that geometrical positions of the words can be involved with a negligible increase in complexity, which can significantly improve the accuracy of LCD. Experiments on various public datasets and comparisons against several state-of-the-art algorithms verify the performance of our proposed approach.

* 11 pages, 9 figures

Via

Access Paper or Ask Questions