Abstract:This paper presents the OmniRace approach to controlling a racing drone with 6-degree of freedom (DoF) hand pose estimation and gesture recognition. To our knowledge, it is the first-ever technology that allows for low-level control of high-speed drones using gestures. OmniRace employs a gesture interface based on computer vision and a deep neural network to estimate a 6-DoF hand pose. The advanced machine learning algorithm robustly interprets human gestures, allowing users to control drone motion intuitively. Real-time control of a racing drone demonstrates the effectiveness of the system, validating its potential to revolutionize drone racing and other applications. Experimental results conducted in the Gazebo simulation environment revealed that OmniRace allows the users to complite the UAV race track significantly (by 25.1%) faster and to decrease the length of the test drone path (from 102.9 to 83.7 m). Users preferred the gesture interface for attractiveness (1.57 UEQ score), hedonic quality (1.56 UEQ score), and lower perceived temporal demand (32.0 score in NASA-TLX), while noting the high efficiency (0.75 UEQ score) and low physical demand (19.0 score in NASA-TLX) of the baseline remote controller. The deep neural network attains an average accuracy of 99.75% when applied to both normalized datasets and raw datasets. OmniRace can potentially change the way humans interact with and navigate racing drones in dynamic and complex environments. The source code is available at https://github.com/SerValera/OmniRace.git.
Abstract:This paper introduces CognitiveDog, a pioneering development of quadruped robot with Large Multi-modal Model (LMM) that is capable of not only communicating with humans verbally but also physically interacting with the environment through object manipulation. The system was realized on Unitree Go1 robot-dog equipped with a custom gripper and demonstrated autonomous decision-making capabilities, independently determining the most appropriate actions and interactions with various objects to fulfill user-defined tasks. These tasks do not necessarily include direct instructions, challenging the robot to comprehend and execute them based on natural language input and environmental cues. The paper delves into the intricacies of this system, dataset characteristics, and the software architecture. Key to this development is the robot's proficiency in navigating space using Visual-SLAM, effectively manipulating and transporting objects, and providing insightful natural language commentary during task execution. Experimental results highlight the robot's advanced task comprehension and adaptability, underscoring its potential in real-world applications. The dataset used to fine-tune the robot-dog behavior generation model is provided at the following link: huggingface.co/datasets/ArtemLykov/CognitiveDog_dataset