Abstract:Prevailing Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone, similar to how Large Language Models (LLMs) process the text tokens. However, the number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs. In this paper, we consider improving MLLM's efficiency from two scenarios, (I) Reducing computational cost without degrading the performance. (II) Improving the performance with given budgets. We start with our main finding that the ranking of each vision token sorted by attention scores is similar in each layer except the first layer. Based on it, we assume that the number of essential top vision tokens does not increase along layers. Accordingly, for Scenario I, we propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep. Interestingly, G-Search is able to reach the optimal reduction strategy based on our assumption. For Scenario II, based on the reduction strategy from G-Search, we design a parametric sigmoid function (P-Sigmoid) to guide the reduction at each layer of the MLLM, whose parameters are optimized by Bayesian Optimization. Extensive experiments demonstrate that our approach can significantly accelerate those popular MLLMs, e.g. LLaVA, and InternVL2 models, by more than $2 \times$ without performance drops. Our approach also far outperforms other token reduction methods when budgets are limited, achieving a better trade-off between efficiency and effectiveness.
Abstract:Aerial manipulators, composed of multirotors and robotic arms, have a structure and function highly reminiscent of avian species. This paper studies the tracking control problem for aerial manipulators. This paper studies the tracking control problem for aerial manipulators. We propose an avian-inspired aerial manipulation system, which includes an avian-inspired robotic arm design, a Recursive Newton-Euler (RNE) method-based nonlinear flight controller, and a coordinated controller with two modes. Compared to existing methods, our proposed approach offers several attractive features. First, the morphological characteristics of avian species are used to determine the size proportion of the multirotor and the robotic arm in the aerial manipulator. Second, the dynamic coupling of the aerial manipulator is addressed by the RNE-based flight controller and a dual-mode coordinated controller. Specifically, under our proposed algorithm, the aerial manipulator can stabilize the end-effector's pose, similar to avian head stabilization. The proposed approach is verified through three numerical experiments. The results show that even when the quadcopter is disturbed by different forces, the position error of the end-effector achieves millimeter-level accuracy, and the attitude error remains within 1 degree. The limitation of this work is not considering aggressive manipulation like that seen in birds. Addressing this through future studies that explore real-world experiments will be a key direction for research.
Abstract:Emerging 3D scene representations, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have demonstrated their effectiveness in Simultaneous Localization and Mapping (SLAM) for photo-realistic rendering, particularly when using high-quality video sequences as input. However, existing methods struggle with motion-blurred frames, which are common in real-world scenarios like low-light or long-exposure conditions. This often results in a significant reduction in both camera localization accuracy and map reconstruction quality. To address this challenge, we propose a dense visual SLAM pipeline (i.e. MBA-SLAM) to handle severe motion-blurred inputs. Our approach integrates an efficient motion blur-aware tracker with either neural radiance fields or Gaussian Splatting based mapper. By accurately modeling the physical image formation process of motion-blurred images, our method simultaneously learns 3D scene representation and estimates the cameras' local trajectory during exposure time, enabling proactive compensation for motion blur caused by camera movement. In our experiments, we demonstrate that MBA-SLAM surpasses previous state-of-the-art methods in both camera localization and map reconstruction, showcasing superior performance across a range of datasets, including synthetic and real datasets featuring sharp images as well as those affected by motion blur, highlighting the versatility and robustness of our approach. Code is available at https://github.com/WU-CVGL/MBA-SLAM.
Abstract:In recent years, there has been a growing interest in the visual detection of micro aerial vehicles (MAVs) due to its importance in numerous applications. However, the existing methods based on either appearance or motion features encounter difficulties when the background is complex or the MAV is too small. In this paper, we propose a novel motion-guided MAV detector that can accurately identify small MAVs in complex and non-planar scenes. This detector first exploits a motion feature enhancement module to capture the motion features of small MAVs. Then it uses multi-object tracking and trajectory filtering to eliminate false positives caused by motion parallax. Finally, an appearance-based classifier and an appearance-based detector that operates on the cropped regions are used to achieve precise detection results. Our proposed method can effectively and efficiently detect extremely small MAVs from dynamic and complex backgrounds because it aggregates pixel-level motion features and eliminates false positives based on the motion and appearance features of MAVs. Experiments on the ARD-MAV dataset demonstrate that the proposed method could achieve high performance in small MAV detection under challenging conditions and outperform other state-of-the-art methods across various metrics
Abstract:We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performance because they are usually applied in different domains, for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn blocks embodied agents from leveraging LLMs effectively and selectively. To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision-making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics which break down evaluation into various types of errors, such as hallucination errors, affordance errors, various types of planning errors, etc. Overall, our benchmark offers a comprehensive assessment of LLMs' performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems, and providing insights for effective and selective use of LLMs in embodied decision making.
Abstract:Large Language Models (LLMs) have significantly enhanced Information Retrieval (IR) across various modules, such as reranking. Despite impressive performance, current zero-shot relevance ranking with LLMs heavily relies on human prompt engineering. Existing automatic prompt engineering algorithms primarily focus on language modeling and classification tasks, leaving the domain of IR, particularly reranking, underexplored. Directly applying current prompt engineering algorithms to relevance ranking is challenging due to the integration of query and long passage pairs in the input, where the ranking complexity surpasses classification tasks. To reduce human effort and unlock the potential of prompt optimization in reranking, we introduce a novel automatic prompt engineering algorithm named APEER. APEER iteratively generates refined prompts through feedback and preference optimization. Extensive experiments with four LLMs and ten datasets demonstrate the substantial performance improvement of APEER over existing state-of-the-art (SoTA) manual prompts. Furthermore, we find that the prompts generated by APEER exhibit better transferability across diverse tasks and LLMs. Code is available at https://github.com/jincan333/APEER.
Abstract:Large language model (LLM) agents have demonstrated impressive capability in utilizing external tools and knowledge to boost accuracy and reduce hallucinations. However, developing the prompting techniques that make LLM agents able to effectively use external tools and knowledge is a heuristic and laborious task. Here, we introduce AvaTaR, a novel and automatic framework that optimizes an LLM agent to effectively use the provided tools and improve its performance on a given task/domain. During optimization, we design a comparator module to iteratively provide insightful and holistic prompts to the LLM agent via reasoning between positive and negative examples sampled from training data. We demonstrate AvaTaR on four complex multimodal retrieval datasets featuring textual, visual, and relational information. We find AvaTaR consistently outperforms state-of-the-art approaches across all four challenging tasks and exhibits strong generalization ability when applied to novel cases, achieving an average relative improvement of 14% on the Hit@1 metric. Code and dataset are available at https://github.com/zou-group/avatar.
Abstract:Answering real-world user queries, such as product search, often requires accurate retrieval of information from semi-structured knowledge bases or databases that involve blend of unstructured (e.g., textual descriptions of products) and structured (e.g., entity relations of products) information. However, previous works have mostly studied textual and relational retrieval tasks as separate topics. To address the gap, we develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Relational Knowledge Bases. We design a novel pipeline to synthesize natural and realistic user queries that integrate diverse relational information and complex textual properties, as well as their ground-truth answers. Moreover, we rigorously conduct human evaluation to validate the quality of our benchmark, which covers a variety of practical applications, including product recommendations, academic paper searches, and precision medicine inquiries. Our benchmark serves as a comprehensive testbed for evaluating the performance of retrieval systems, with an emphasis on retrieval approaches driven by large language models (LLMs). Our experiments suggest that the STARK datasets present significant challenges to the current retrieval and LLM systems, indicating the demand for building more capable retrieval systems that can handle both textual and relational aspects.
Abstract:Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of continuously curating and annotating data with significant human effort. We propose to leverage recent advances in vision-language and large language models to design an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. This process operates iteratively, allowing for continuous self-improvement of the model. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
Abstract:Visual detection of Micro Air Vehicles (MAVs) has attracted increasing attention in recent years due to its important application in various tasks. The existing methods for MAV detection assume that the training set and testing set have the same distribution. As a result, when deployed in new domains, the detectors would have a significant performance degradation due to domain discrepancy. In this paper, we study the problem of cross-domain MAV detection. The contributions of this paper are threefold. 1) We propose a Multi-MAV-Multi-Domain (M3D) dataset consisting of both simulation and realistic images. Compared to other existing datasets, the proposed one is more comprehensive in the sense that it covers rich scenes, diverse MAV types, and various viewing angles. A new benchmark for cross-domain MAV detection is proposed based on the proposed dataset. 2) We propose a Noise Suppression Network (NSN) based on the framework of pseudo-labeling and a large-to-small training procedure. To reduce the challenging pseudo-label noises, two novel modules are designed in this network. The first is a prior-based curriculum learning module for allocating adaptive thresholds for pseudo labels with different difficulties. The second is a masked copy-paste augmentation module for pasting truly-labeled MAVs on unlabeled target images and thus decreasing pseudo-label noises. 3) Extensive experimental results verify the superior performance of the proposed method compared to the state-of-the-art ones. In particular, it achieves mAP of 46.9%(+5.8%), 50.5%(+3.7%), and 61.5%(+11.3%) on the tasks of simulation-to-real adaptation, cross-scene adaptation, and cross-camera adaptation, respectively.