Abstract:Large multimodal models exhibit remarkable intelligence, yet their embodied cognitive abilities during motion in open-ended urban 3D space remain to be explored. We introduce a benchmark to evaluate whether video-large language models (Video-LLMs) can naturally process continuous first-person visual observations like humans, enabling recall, perception, reasoning, and navigation. We have manually control drones to collect 3D embodied motion video data from real-world cities and simulated environments, resulting in 1.5k video clips. Then we design a pipeline to generate 5.2k multiple-choice questions. Evaluations of 17 widely-used Video-LLMs reveal current limitations in urban embodied cognition. Correlation analysis provides insight into the relationships between different tasks, showing that causal reasoning has a strong correlation with recall, perception, and navigation, while the abilities for counterfactual and associative reasoning exhibit lower correlation with other tasks. We also validate the potential for Sim-to-Real transfer in urban embodiment through fine-tuning.
Abstract:Self-assembly enables multi-robot systems to merge diverse capabilities and accomplish tasks beyond the reach of individual robots. Incorporating varied docking mechanisms layouts (DMLs) can enhance robot versatility or reduce costs. However, assembling multiple heterogeneous robots with diverse DMLs is still a research gap. This paper addresses this problem by introducing CuBoat, an omnidirectional unmanned surface vehicle (USV). CuBoat can be equipped with or without docking systems on its four sides to emulate heterogeneous robots. We implement a multi-robot system based on multiple CuBoats. To enhance maneuverability, a linear active disturbance rejection control (LADRC) scheme is proposed. Additionally, we present a generalized parallel self-assembly planning algorithm for efficient assembly among CuBoats with different DMLs. Validation is conducted through simulation within 2 scenarios across 4 distinct maps, demonstrating the performance of the self-assembly planning algorithm. Moreover, trajectory tracking tests confirm the effectiveness of the LADRC controller. Self-assembly experiments on 5 maps with different target structures affirm the algorithm's feasibility and generality. This study advances robotic self-assembly, enabling multi-robot systems to collaboratively tackle complex tasks beyond the capabilities of individual robots.