Abstract:Trajectory optimization in multi-vehicle scenarios faces challenges due to its non-linear, non-convex properties and sensitivity to initial values, making interactions between vehicles difficult to control. In this paper, inspired by topological planning, we propose a differentiable local homotopy invariant metric to model the interactions. By incorporating this topological metric as a constraint into multi-vehicle trajectory optimization, our framework is capable of generating multiple interactive trajectories from the same initial values, achieving controllable interactions as well as supporting user-designed interaction patterns. Extensive experiments demonstrate its superior optimality and efficiency over existing methods. We will release open-source code to advance relative research.
Abstract:Achieving large-scale aerial swarms is challenging due to the inherent contradictions in balancing computational efficiency and scalability. This paper introduces Primitive-Swarm, an ultra-lightweight and scalable planner designed specifically for large-scale autonomous aerial swarms. The proposed approach adopts a decentralized and asynchronous replanning strategy. Within it is a novel motion primitive library consisting of time-optimal and dynamically feasible trajectories. They are generated utlizing a novel time-optimial path parameterization algorithm based on reachability analysis (TOPP-RA). Then, a rapid collision checking mechanism is developed by associating the motion primitives with the discrete surrounding space according to conflicts. By considering both spatial and temporal conflicts, the mechanism handles robot-obstacle and robot-robot collisions simultaneously. Then, during a replanning process, each robot selects the safe and minimum cost trajectory from the library based on user-defined requirements. Both the time-optimal motion primitive library and the occupancy information are computed offline, turning a time-consuming optimization problem into a linear-complexity selection problem. This enables the planner to comprehensively explore the non-convex, discontinuous 3-D safe space filled with numerous obstacles and robots, effectively identifying the best hidden path. Benchmark comparisons demonstrate that our method achieves the shortest flight time and traveled distance with a computation time of less than 1 ms in dense environments. Super large-scale swarm simulations, involving up to 1000 robots, running in real-time, verify the scalability of our method. Real-world experiments validate the feasibility and robustness of our approach. The code will be released to foster community collaboration.
Abstract:It is a significant requirement for a quadrotor trajectory planner to simultaneously guarantee trajectory quality and system lightweight. Many researchers focus on this problem, but there's still a gap between their performance and our common wish. In this paper, we propose an ultra lightweight quadrotor planner with time-optimal primitives. Firstly, a novel motion primitive library is proposed to generate time-optimal and dynamical feasible trajectories offline. Secondly, we propose a fast collision checking method with a deterministic time consumption, independent of the sampling resolution of the primitives. Finally, we select the minimum cost trajectory to execute among the safe primitives based on user-defined requirements. The propsed transformation relation between the local trajectories ensures the smoothness of the global trajectory. The planner reduces unnecessary online computing power consumption as much as possible, while ensuring a high-quality trajectory. Benchmark comparisons show that our method can generate the shortest flight time and distance of trajectory with the lowest computation overload. Challenging real-world experiments validate the robustness of our method.
Abstract:VINGS-Mono is a monocular (inertial) Gaussian Splatting (GS) SLAM framework designed for large scenes. The framework comprises four main components: VIO Front End, 2D Gaussian Map, NVS Loop Closure, and Dynamic Eraser. In the VIO Front End, RGB frames are processed through dense bundle adjustment and uncertainty estimation to extract scene geometry and poses. Based on this output, the mapping module incrementally constructs and maintains a 2D Gaussian map. Key components of the 2D Gaussian Map include a Sample-based Rasterizer, Score Manager, and Pose Refinement, which collectively improve mapping speed and localization accuracy. This enables the SLAM system to handle large-scale urban environments with up to 50 million Gaussian ellipsoids. To ensure global consistency in large-scale scenes, we design a Loop Closure module, which innovatively leverages the Novel View Synthesis (NVS) capabilities of Gaussian Splatting for loop closure detection and correction of the Gaussian map. Additionally, we propose a Dynamic Eraser to address the inevitable presence of dynamic objects in real-world outdoor scenes. Extensive evaluations in indoor and outdoor environments demonstrate that our approach achieves localization performance on par with Visual-Inertial Odometry while surpassing recent GS/NeRF SLAM methods. It also significantly outperforms all existing methods in terms of mapping and rendering quality. Furthermore, we developed a mobile app and verified that our framework can generate high-quality Gaussian maps in real time using only a smartphone camera and a low-frequency IMU sensor. To the best of our knowledge, VINGS-Mono is the first monocular Gaussian SLAM method capable of operating in outdoor environments and supporting kilometer-scale large scenes.
Abstract:Unmanned aerial vehicle object detection (UAV-OD) has been widely used in various scenarios. However, most existing UAV-OD algorithms rely on manually designed components, which require extensive tuning. End-to-end models that do not depend on such manually designed components are mainly designed for natural images, which are less effective for UAV imagery. To address such challenges, this paper proposes an efficient detection transformer (DETR) framework tailored for UAV imagery, i.e., UAV-DETR. The framework includes a multi-scale feature fusion with frequency enhancement module, which captures both spatial and frequency information at different scales. In addition, a frequency-focused down-sampling module is presented to retain critical spatial details during down-sampling. A semantic alignment and calibration module is developed to align and fuse features from different fusion paths. Experimental results demonstrate the effectiveness and generalization of our approach across various UAV imagery datasets. On the VisDrone dataset, our method improves AP by 3.1\% and $\text{AP}_{50}$ by 4.2\% over the baseline. Similar enhancements are observed on the UAVVaste dataset. The project page: https://github.com/ValiantDiligent/UAV-DETR
Abstract:Planning safe trajectories for autonomous vehicles is essential for operational safety but remains extremely challenging due to the complex interactions among traffic participants. Recent autonomous driving frameworks have focused on improving prediction accuracy to explicitly model these interactions. However, some methods overlook the significant influence of the ego vehicle's planning on the possible trajectories of other agents, which can alter prediction accuracy and lead to unsafe planning decisions. In this paper, we propose a novel motion Planning approach by Simulation with learning-based parallel scenario prediction (PS). PS deduces predictions iteratively based on Monte Carlo Tree Search (MCTS), jointly inferring scenarios that cooperate with the ego vehicle's planning set. Our method simulates possible scenes and calculates their costs after the ego vehicle executes potential actions. To balance and prune unreasonable actions and scenarios, we adopt MCTS as the foundation to explore possible future interactions encoded within the prediction network. Moreover, the query-centric trajectory prediction streamlines our scene generation, enabling a sophisticated framework that captures the mutual influence between other agents' predictions and the ego vehicle's planning. We evaluate our framework on the Argoverse 2 dataset, and the results demonstrate that our approach effectively achieves parallel ego vehicle planning.
Abstract:Convolutional neural networks (CNNs) and vision transformers (ViTs) have become essential in computer vision for local and global feature extraction. However, aggregating these architectures in existing methods often results in inefficiencies. To address this, the CNN-Transformer Aggregation Network (CTA-Net) was developed. CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features. This integration enables efficient processing of detailed local and broader contextual information. CTA-Net introduces the Light Weight Multi-Scale Feature Fusion Multi-Head Self-Attention (LMF-MHSA) module for effective multi-scale feature integration with reduced parameters. Additionally, the Reverse Reconstruction CNN-Variants (RRCV) module enhances the embedding of CNNs within the transformer architecture. Extensive experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance (TOP-1 Acc 86.76\%), fewer parameters (20.32M), and greater efficiency (FLOPs 2.83B), making it a highly efficient and lightweight solution for visual tasks on small-scale datasets (fewer than 100,000).
Abstract:Occlusion-aware decision-making is essential in autonomous driving due to the high uncertainty of various occlusions. Recent occlusion-aware decision-making methods encounter issues such as high computational complexity, scenario scalability challenges, or reliance on limited expert data. Benefiting from automatically generating data by exploration randomization, we uncover that reinforcement learning (RL) may show promise in occlusion-aware decision-making. However, previous occlusion-aware RL faces challenges in expanding to various dynamic and static occlusion scenarios, low learning efficiency, and lack of predictive ability. To address these issues, we introduce Pad-AI, a self-reinforcing framework to learn occlusion-aware decision-making through active perception. Pad-AI utilizes vectorized representation to represent occluded environments efficiently and learns over the semantic motion primitives to focus on high-level active perception exploration. Furthermore, Pad-AI integrates prediction and RL within a unified framework to provide risk-aware learning and security guarantees. Our framework was tested in challenging scenarios under both dynamic and static occlusions and demonstrated efficient and general perception-aware exploration performance to other strong baselines in closed-loop evaluations.
Abstract:In complex missions such as search and rescue,robots must make intelligent decisions in unknown environments, relying on their ability to perceive and understand their surroundings. High-quality and real-time reconstruction enhances situational awareness and is crucial for intelligent robotics. Traditional methods often struggle with poor scene representation or are too slow for real-time use. Inspired by the efficacy of 3D Gaussian Splatting (3DGS), we propose a hierarchical planning framework for fast and high-fidelity active reconstruction. Our method evaluates completion and quality gain to adaptively guide reconstruction, integrating global and local planning for efficiency. Experiments in simulated and real-world environments show our approach outperforms existing real-time methods.
Abstract:The rise of multi-modal large language models(MLLMs) has spurred their applications in autonomous driving. Recent MLLM-based methods perform action by learning a direct mapping from perception to action, neglecting the dynamics of the world and the relations between action and world dynamics. In contrast, human beings possess world model that enables them to simulate the future states based on 3D internal visual representation and plan actions accordingly. To this end, we propose OccLLaMA, an occupancy-language-action generative world model, which uses semantic occupancy as a general visual representation and unifies vision-language-action(VLA) modalities through an autoregressive model. Specifically, we introduce a novel VQVAE-like scene tokenizer to efficiently discretize and reconstruct semantic occupancy scenes, considering its sparsity and classes imbalance. Then, we build a unified multi-modal vocabulary for vision, language and action. Furthermore, we enhance LLM, specifically LLaMA, to perform the next token/scene prediction on the unified vocabulary to complete multiple tasks in autonomous driving. Extensive experiments demonstrate that OccLLaMA achieves competitive performance across multiple tasks, including 4D occupancy forecasting, motion planning, and visual question answering, showcasing its potential as a foundation model in autonomous driving.