Space Center, Skolkovo Institute of Science and Technology
Abstract:This paper presents CONTHER, a novel reinforcement learning algorithm designed to efficiently and rapidly train robotic agents for goal-oriented manipulation tasks and obstacle avoidance. The algorithm uses a modified replay buffer inspired by the Hindsight Experience Replay (HER) approach to artificially populate experience with successful trajectories, effectively addressing the problem of sparse reward scenarios and eliminating the need to manually collect expert demonstrations. The developed algorithm proposes a Transformer-based architecture to incorporate the context of previous states, allowing the agent to perform a deeper analysis and make decisions in a manner more akin to human learning. The effectiveness of the built-in replay buffer, which acts as an "internal demonstrator", is twofold: it accelerates learning and allows the algorithm to adapt to different tasks. Empirical data confirm the superiority of the algorithm by an average of 38.46% over other considered methods, and the most successful baseline by 28.21%, showing higher success rates and faster convergence in the point-reaching task. Since the control is performed through the robot's joints, the algorithm facilitates potential adaptation to a real robot system and construction of an obstacle avoidance task. Therefore, the algorithm has also been tested on tasks requiring following a complex dynamic trajectory and obstacle avoidance. The design of the algorithm ensures its applicability to a wide range of goal-oriented tasks, making it an easily integrated solution for real-world robotics applications.
Abstract:We introduce AttentionSwarm, a novel benchmark designed to evaluate safe and efficient swarm control across three challenging environments: a landing environment with obstacles, a competitive drone game setting, and a dynamic drone racing scenario. Central to our approach is the Attention Model Based Control Barrier Function (CBF) framework, which integrates attention mechanisms with safety-critical control theory to enable real-time collision avoidance and trajectory optimization. This framework dynamically prioritizes critical obstacles and agents in the swarms vicinity using attention weights, while CBFs formally guarantee safety by enforcing collision-free constraints. The safe attention net algorithm was developed and evaluated using a swarm of Crazyflie 2.1 micro quadrotors, which were tested indoors with the Vicon motion capture system to ensure precise localization and control. Experimental results show that our system achieves landing accuracy of 3.02 cm with a mean time of 23 s and collision-free landings in a dynamic landing environment, 100% and collision-free navigation in a drone game environment, and 95% and collision-free navigation for a dynamic multiagent drone racing environment, underscoring its effectiveness and robustness in real-world scenarios. This work offers a promising foundation for applications in dynamic environments where safety and fastness are paramount.
Abstract:This paper tackles decentralized continuous task allocation in heterogeneous multi-agent systems. We present a novel framework HIPPO-MAT that integrates graph neural networks (GNN) employing a GraphSAGE architecture to compute independent embeddings on each agent with an Independent Proximal Policy Optimization (IPPO) approach for multi-agent deep reinforcement learning. In our system, unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) share aggregated observation data via communication channels while independently processing these inputs to generate enriched state embeddings. This design enables dynamic, cost-optimal, conflict-aware task allocation in a 3D grid environment without the need for centralized coordination. A modified A* path planner is incorporated for efficient routing and collision avoidance. Simulation experiments demonstrate scalability with up to 30 agents and preliminary real-world validation on JetBot ROS AI Robots, each running its model on a Jetson Nano and communicating through an ESP-NOW protocol using ESP32-S3, which confirms the practical viability of the approach that incorporates simultaneous localization and mapping (SLAM). Experimental results revealed that our method achieves a high 92.5% conflict-free success rate, with only a 16.49% performance gap compared to the centralized Hungarian method, while outperforming the heuristic decentralized baseline based on greedy approach. Additionally, the framework exhibits scalability with up to 30 agents with allocation processing of 0.32 simulation step time and robustness in responding to dynamically generated tasks.
Abstract:RaceVLA presents an innovative approach for autonomous racing drone navigation by leveraging Visual-Language-Action (VLA) to emulate human-like behavior. This research explores the integration of advanced algorithms that enable drones to adapt their navigation strategies based on real-time environmental feedback, mimicking the decision-making processes of human pilots. The model, fine-tuned on a collected racing drone dataset, demonstrates strong generalization despite the complexity of drone racing environments. RaceVLA outperforms OpenVLA in motion (75.0 vs 60.0) and semantic generalization (45.5 vs 36.3), benefiting from the dynamic camera and simplified motion tasks. However, visual (79.6 vs 87.0) and physical (50.0 vs 76.7) generalization were slightly reduced due to the challenges of maneuvering in dynamic environments with varying object sizes. RaceVLA also outperforms RT-2 across all axes - visual (79.6 vs 52.0), motion (75.0 vs 55.0), physical (50.0 vs 26.7), and semantic (45.5 vs 38.8), demonstrating its robustness for real-time adjustments in complex environments. Experiments revealed an average velocity of 1.04 m/s, with a maximum speed of 2.02 m/s, and consistent maneuverability, demonstrating RaceVLA's ability to handle high-speed scenarios effectively. These findings highlight the potential of RaceVLA for high-performance navigation in competitive racing contexts. The RaceVLA codebase, pretrained weights, and dataset are available at this http URL: https://racevla.github.io/
Abstract:Swarm robotics plays a crucial role in enabling autonomous operations in dynamic and unpredictable environments. However, a major challenge remains ensuring safe and efficient navigation in environments filled with both dynamic alive (e.g., humans) and dynamic inanimate (e.g., non-living objects) obstacles. In this paper, we propose ImpedanceGPT, a novel system that combines a Vision-Language Model (VLM) with retrieval-augmented generation (RAG) to enable real-time reasoning for adaptive navigation of mini-drone swarms in complex environments. The key innovation of ImpedanceGPT lies in the integration of VLM and RAG, which provides the drones with enhanced semantic understanding of their surroundings. This enables the system to dynamically adjust impedance control parameters in response to obstacle types and environmental conditions. Our approach not only ensures safe and precise navigation but also improves coordination between drones in the swarm. Experimental evaluations demonstrate the effectiveness of the system. The VLM-RAG framework achieved an obstacle detection and retrieval accuracy of 80 % under optimal lighting. In static environments, drones navigated dynamic inanimate obstacles at 1.4 m/s but slowed to 0.7 m/s with increased separation around humans. In dynamic environments, speed adjusted to 1.0 m/s near hard obstacles, while reducing to 0.6 m/s with higher deflection to safely avoid moving humans.
Abstract:The UAV-VLPA* (Visual-Language-Planning-and-Action) system represents a cutting-edge advancement in aerial robotics, designed to enhance communication and operational efficiency for unmanned aerial vehicles (UAVs). By integrating advanced planning capabilities, the system addresses the Traveling Salesman Problem (TSP) to optimize flight paths, reducing the total trajectory length by 18.5\% compared to traditional methods. Additionally, the incorporation of the A* algorithm enables robust obstacle avoidance, ensuring safe and efficient navigation in complex environments. The system leverages satellite imagery processing combined with the Visual Language Model (VLM) and GPT's natural language processing capabilities, allowing users to generate detailed flight plans through simple text commands. This seamless fusion of visual and linguistic analysis empowers precise decision-making and mission planning, making UAV-VLPA* a transformative tool for modern aerial operations. With its unmatched operational efficiency, navigational safety, and user-friendly functionality, UAV-VLPA* sets a new standard in autonomous aerial robotics, paving the way for future innovations in the field.
Abstract:Emergency search and rescue (SAR) operations often require rapid and precise target identification in complex environments where traditional manual drone control is inefficient. In order to address these scenarios, a rapid SAR system, UAV-VLRR (Vision-Language-Rapid-Response), is developed in this research. This system consists of two aspects: 1) A multimodal system which harnesses the power of Visual Language Model (VLM) and the natural language processing capabilities of ChatGPT-4o (LLM) for scene interpretation. 2) A non-linearmodel predictive control (NMPC) with built-in obstacle avoidance for rapid response by a drone to fly according to the output of the multimodal system. This work aims at improving response times in emergency SAR operations by providing a more intuitive and natural approach to the operator to plan the SAR mission while allowing the drone to carry out that mission in a rapid and safe manner. When tested, our approach was faster on an average by 33.75% when compared with an off-the-shelf autopilot and 54.6% when compared with a human pilot. Video of UAV-VLRR: https://youtu.be/KJqQGKKt1xY
Abstract:This paper introduces CognitiveDrone, a novel Vision-Language-Action (VLA) model tailored for complex Unmanned Aerial Vehicles (UAVs) tasks that demand advanced cognitive abilities. Trained on a dataset comprising over 8,000 simulated flight trajectories across three key categories-Human Recognition, Symbol Understanding, and Reasoning-the model generates real-time 4D action commands based on first-person visual inputs and textual instructions. To further enhance performance in intricate scenarios, we propose CognitiveDrone-R1, which integrates an additional Vision-Language Model (VLM) reasoning module to simplify task directives prior to high-frequency control. Experimental evaluations using our open-source benchmark, CognitiveDroneBench, reveal that while a racing-oriented model (RaceVLA) achieves an overall success rate of 31.3%, the base CognitiveDrone model reaches 59.6%, and CognitiveDrone-R1 attains a success rate of 77.2%. These results demonstrate improvements of up to 30% in critical cognitive tasks, underscoring the effectiveness of incorporating advanced reasoning capabilities into UAV control systems. Our contributions include the development of a state-of-the-art VLA model for UAV control and the introduction of the first dedicated benchmark for assessing cognitive tasks in drone operations. The complete repository is available at cognitivedrone.github.io
Abstract:In autonomous driving, dynamic environment and corner cases pose significant challenges to the robustness of ego vehicle's decision-making. To address these challenges, commencing with the representation of state-action mapping in the end-to-end autonomous driving paradigm, we introduce a novel pipeline, VDT-Auto. Leveraging the advancement of the state understanding of Visual Language Model (VLM), incorporating with diffusion Transformer-based action generation, our VDT-Auto parses the environment geometrically and contextually for the conditioning of the diffusion process. Geometrically, we use a bird's-eye view (BEV) encoder to extract feature grids from the surrounding images. Contextually, the structured output of our fine-tuned VLM is processed into textual embeddings and noisy paths. During our diffusion process, the added noise for the forward process is sampled from the noisy path output of the fine-tuned VLM, while the extracted BEV feature grids and embedded texts condition the reverse process of our diffusion Transformers. Our VDT-Auto achieved 0.52m on average L2 errors and 21% on average collision rate in the nuScenes open-loop planning evaluation. Moreover, the real-world demonstration exhibited prominent generalizability of our VDT-Auto. The code and dataset will be released after acceptance.
Abstract:We propose a new concept, Evolution 6.0, which represents the evolution of robotics driven by Generative AI. When a robot lacks the necessary tools to accomplish a task requested by a human, it autonomously designs the required instruments and learns how to use them to achieve the goal. Evolution 6.0 is an autonomous robotic system powered by Vision-Language Models (VLMs), Vision-Language Action (VLA) models, and Text-to-3D generative models for tool design and task execution. The system comprises two key modules: the Tool Generation Module, which fabricates task-specific tools from visual and textual data, and the Action Generation Module, which converts natural language instructions into robotic actions. It integrates QwenVLM for environmental understanding, OpenVLA for task execution, and Llama-Mesh for 3D tool generation. Evaluation results demonstrate a 90% success rate for tool generation with a 10-second inference time, and action generation achieving 83.5% in physical and visual generalization, 70% in motion generalization, and 37% in semantic generalization. Future improvements will focus on bimanual manipulation, expanded task capabilities, and enhanced environmental interpretation to improve real-world adaptability.