Abstract:Large foundation models trained on large-scale visual-text data can significantly enhance Open Vocabulary Object Detection (OVD) through data generation. However, this may lead to biased synthetic data and overfitting to specific configurations. It can sidestep biases of manually curated data generation by directly leveraging hidden states of Large Language Models (LLMs), which is surprisingly rarely explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of a MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge transfer from LLMs to object detectors, an new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We demonstrate that intermediate hidden states from early LLM layers retain strong spatial-semantic correlations that are beneficial to grounding tasks. Experiments show that our adaptation strategy significantly enhances the performance on complex free-form text queries while remaining the same on plain categories. With our adaptation, Qwen2-0.5B with Swin-T as the vision encoder improves GroundingDINO by 2.33% on Omnilabel, at the overhead of 8.7% more GFLOPs. Qwen2-0.5B with a larger vision encoder can further boost the performance by 6.22%. We further validate our design by ablating on varied adapter architectures, sizes of LLMs, and which layers to add adaptation.
Abstract:This paper addresses the multi-robot pursuit problem for an unknown target, encompassing both target state estimation and pursuit control. First, in state estimation, we focus on using only bearing information, as it is readily available from vision sensors and effective for small, distant targets. Challenges such as instability due to the nonlinearity of bearing measurements and singularities in the two-angle representation are addressed through a proposed uniform bearing-only information filter. This filter integrates multiple 3D bearing measurements, provides a concise formulation, and enhances stability and resilience to target loss caused by limited field of view (FoV). Second, in target pursuit control within complex environments, where challenges such as heterogeneity and limited FoV arise, conventional methods like differential games or Voronoi partitioning often prove inadequate. To address these limitations, we propose a novel multiagent reinforcement learning (MARL) framework, enabling multiple heterogeneous vehicles to search, localize, and follow a target while effectively handling those challenges. Third, to bridge the sim-to-real gap, we propose two key techniques: incorporating adjustable low-level control gains in training to replicate the dynamics of real-world autonomous ground vehicles (AGVs), and proposing spectral-normalized RL algorithms to enhance policy smoothness and robustness. Finally, we demonstrate the successful zero-shot transfer of the MARL controllers to AGVs, validating the effectiveness and practical feasibility of our approach. The accompanying video is available at https://youtu.be/HO7FJyZiJ3E.
Abstract:Vision-based drone-to-drone detection has attracted increasing attention due to its importance in numerous tasks such as vision-based swarming, aerial see-and-avoid, and malicious drone detection. However, existing methods often encounter failures when the background is complex or the target is tiny. This paper proposes a novel end-to-end framework that accurately identifies small drones in complex environments using motion guidance. It starts by creating a motion difference map to capture the motion characteristics of tiny drones. Next, this motion difference map is combined with an RGB image using a bimodal fusion module, allowing for adaptive feature learning of the drone. Finally, the fused feature map is processed through an enhanced backbone and detection head based on the YOLOv5 framework to achieve accurate detection results. To validate our method, we propose a new dataset, named ARD100, which comprises 100 videos (202,467 frames) covering various challenging conditions and has the smallest average object size compared with the existing drone detection datasets. Extensive experiments on the ARD100 and NPS-Drones datasets show that our proposed detector performs exceptionally well under challenging conditions and surpasses state-of-the-art algorithms across various metrics. We publicly release the codes and ARD100 dataset at https://github.com/Irisky123/YOLOMG.
Abstract:In this paper, we propose a framework, collective behavioral cloning (CBC), to learn the underlying interaction mechanism and control policy of a swarm system. Given the trajectory data of a swarm system, we propose a graph variational autoencoder (GVAE) to learn the local interaction graph. Based on the interaction graph and swarm trajectory, we use behavioral cloning to learn the control policy of the swarm system. To demonstrate the practicality of CBC, we deploy it on a real-world decentralized vision-based robot swarm system. A visual attention network is trained based on the learned interaction graph for online neighbor selection. Experimental results show that our method outperforms previous approaches in predicting both the interaction graph and swarm actions with higher accuracy. This work offers a promising approach for understanding interaction mechanisms and swarm dynamics in future swarm robotics research. Code and data are available.
Abstract:The capture of flying MAVs (micro aerial vehicles) has garnered increasing research attention due to its intriguing challenges and promising applications. Despite recent advancements, a key limitation of existing work is that capture strategies are often relatively simple and constrained by platform performance. This paper addresses control strategies capable of capturing high-maneuverability targets. The unique challenge of achieving target capture under unstable conditions distinguishes this task from traditional pursuit-evasion and guidance problems. In this study, we transition from larger MAV platforms to a specially designed, compact capture MAV equipped with a custom launching device while maintaining high maneuverability. We explore both time-optimal planning (TOP) and reinforcement learning (RL) methods. Simulations demonstrate that TOP offers highly maneuverable and shorter trajectories, while RL excels in real-time adaptability and stability. Moreover, the RL method has been tested in real-world scenarios, successfully achieving target capture even in unstable states.
Abstract:MAV-capturing-MAV (MCM) is one of the few effective methods for physically countering misused or malicious MAVs.This paper presents a vision-based cooperative MCM system, where multiple pursuer MAVs equipped with onboard vision systems detect, localize, and pursue a target MAV. To enhance robustness, a distributed state estimation and control framework enables the pursuer MAVs to autonomously coordinate their actions. Pursuer trajectories are optimized using Model Predictive Control (MPC) and executed via a low-level SO(3) controller, ensuring smooth and stable pursuit. Once the capture conditions are satisfied, the pursuer MAVs automatically deploy a flying net to intercept the target. These capture conditions are determined based on the predicted motion of the net. To enable real-time decision-making, we propose a lightweight computational method to approximate the net motion, avoiding the prohibitive cost of solving the full net dynamics. The effectiveness of the proposed system is validated through simulations and real-world experiments. In real-world tests, our approach successfully captures a moving target traveling at 4 meters per second with an acceleration of 1 meter per square second, achieving a success rate of 64.7 percent.
Abstract:Although acrobatic flight control has been studied extensively, one key limitation of the existing methods is that they are usually restricted to specific maneuver tasks and cannot change flight pattern parameters online. In this work, we propose a target-and-command-oriented reinforcement learning (TACO) framework, which can handle different maneuver tasks in a unified way and allows online parameter changes. Additionally, we propose a spectral normalization method with input-output rescaling to enhance the policy's temporal and spatial smoothness, independence, and symmetry, thereby overcoming the sim-to-real gap. We validate the TACO approach through extensive simulation and real-world experiments, demonstrating its capability to achieve high-speed circular flights and continuous multi-flips.
Abstract:High-precision manipulation has always been a developmental goal for aerial manipulators. This paper investigates the kinematic coordinate control issue in aerial manipulators. We propose a predictive kinematic coordinate control method, which includes a learning-based modified kinematic model and a model predictive control (MPC) scheme based on weight allocation. Compared to existing methods, our proposed approach offers several attractive features. First, the kinematic model incorporates closed-loop dynamics characteristics and online residual learning. Compared to methods that do not consider closed-loop dynamics and residuals, our proposed method has improved accuracy by 59.6$\%$. Second, a MPC scheme that considers weight allocation has been proposed, which can coordinate the motion strategies of quadcopters and manipulators. Compared to methods that do not consider weight allocation, the proposed method can meet the requirements of more tasks. The proposed approach is verified through complex trajectory tracking and moving target tracking experiments. The results validate the effectiveness of the proposed method.
Abstract:Vision-based target motion estimation is a fundamental problem in many robotic tasks. The existing methods have the limitation of low observability and, hence, face challenges in tracking highly maneuverable targets. Motivated by the aerial target pursuit task where a target may maneuver in 3D space, this paper studies how to further enhance observability by incorporating the \emph{bearing rate} information that has not been well explored in the literature. The main contribution of this paper is to propose a new cooperative estimator called STT-R (Spatial-Temporal Triangulation with bearing Rate), which is designed under the framework of distributed recursive least squares. This theoretical result is further verified by numerical simulation and real-world experiments. It is shown that the proposed STT-R algorithm can effectively generate more accurate estimations and effectively reduce the lag in velocity estimation, enabling tracking of more maneuverable targets.
Abstract:Visual prompting has gained popularity as a method for adapting pre-trained models to specific tasks, particularly in the realm of parameter-efficient tuning. However, existing visual prompting techniques often pad the prompt parameters around the image, limiting the interaction between the visual prompts and the original image to a small set of patches while neglecting the inductive bias present in shared information across different patches. In this study, we conduct a thorough preliminary investigation to identify and address these limitations. We propose a novel visual prompt design, introducing Low-Rank matrix multiplication for Visual Prompting (LoR-VP), which enables shared and patch-specific information across rows and columns of image pixels. Extensive experiments across seven network architectures and four datasets demonstrate significant improvements in both performance and efficiency compared to state-of-the-art visual prompting methods, achieving up to 6 times faster training times, utilizing 18 times fewer visual prompt parameters, and delivering a 3.1% improvement in performance. The code is available as https://github.com/jincan333/LoR-VP.