Abstract:In recent years, the integration of large language models (LLMs) has revolutionized the field of robotics, enabling robots to communicate, understand, and reason with human-like proficiency. This paper explores the multifaceted impact of LLMs on robotics, addressing key challenges and opportunities for leveraging these models across various domains. By categorizing and analyzing LLM applications within core robotics elements -- communication, perception, planning, and control -- we aim to provide actionable insights for researchers seeking to integrate LLMs into their robotic systems. Our investigation focuses on LLMs developed post-GPT-3.5, primarily in text-based modalities while also considering multimodal approaches for perception and control. We offer comprehensive guidelines and examples for prompt engineering, facilitating beginners' access to LLM-based robotics solutions. Through tutorial-level examples and structured prompt construction, we illustrate how LLM-guided enhancements can be seamlessly integrated into robotics applications. This survey serves as a roadmap for researchers navigating the evolving landscape of LLM-driven robotics, offering a comprehensive overview and practical guidance for harnessing the power of language models in robotics development.
Abstract:We aim to solve the problem of spatially localizing composite instructions referring to space: space grounding. Compared to current instance grounding, space grounding is challenging due to the ill-posedness of identifying locations referred to by discrete expressions and the compositional ambiguity of referring expressions. Therefore, we propose a novel probabilistic space-grounding methodology (LINGO-Space) that accurately identifies a probabilistic distribution of space being referred to and incrementally updates it, given subsequent referring expressions leveraging configurable polar distributions. Our evaluations show that the estimation using polar distributions enables a robot to ground locations successfully through $20$ table-top manipulation benchmark tests. We also show that updating the distribution helps the grounding method accurately narrow the referring space. We finally demonstrate the robustness of the space grounding with simulated manipulation and real quadruped robot navigation tasks. Code and videos are available at https://lingo-space.github.io.
Abstract:We aim to solve the problem of data-driven collision-distance estimation given 3-dimensional (3D) geometries. Conventional algorithms suffer from low accuracy due to their reliance on limited representations, such as point clouds. In contrast, our previous graph-based model, GraphDistNet, achieves high accuracy using edge information but incurs higher message-passing costs with growing graph size, limiting its applicability to 3D geometries. To overcome these challenges, we propose GDN-R, a novel 3D graph-based estimation network.GDN-R employs a layer-wise probabilistic graph-rewiring algorithm leveraging the differentiable Gumbel-top-K relaxation. Our method accurately infers minimum distances through iterative graph rewiring and updating relevant embeddings. The probabilistic rewiring enables fast and robust embedding with respect to unforeseen categories of geometries. Through 41,412 random benchmark tasks with 150 pairs of 3D objects, we show GDN-R outperforms state-of-the-art baseline methods in terms of accuracy and generalizability. We also show that the proposed rewiring improves the update performance reducing the size of the estimation model. We finally show its batch prediction and auto-differentiation capabilities for trajectory optimization in both simulated and real-world scenarios.
Abstract:The spoken language serves as an accessible and efficient interface, enabling non-experts and disabled users to interact with complex assistant robots. However, accurately grounding language utterances gives a significant challenge due to the acoustic variability in speakers' voices and environmental noise. In this work, we propose a novel speech-scene graph grounding network (SGGNet$^2$) that robustly grounds spoken utterances by leveraging the acoustic similarity between correctly recognized and misrecognized words obtained from automatic speech recognition (ASR) systems. To incorporate the acoustic similarity, we extend our previous grounding model, the scene-graph-based grounding network (SGGNet), with the ASR model from NVIDIA NeMo. We accomplish this by feeding the latent vector of speech pronunciations into the BERT-based grounding network within SGGNet. We evaluate the effectiveness of using latent vectors of speech commands in grounding through qualitative and quantitative studies. We also demonstrate the capability of SGGNet$^2$ in a speech-based navigation task using a real quadruped robot, RBQ-3, from Rainbow Robotics.
Abstract:We present the problem of inverse constraint learning (ICL), which recovers constraints from demonstrations to autonomously reproduce constrained skills in new scenarios. However, ICL suffers from an ill-posed nature, leading to inaccurate inference of constraints from demonstrations. To figure it out, we introduce a transferable constraint learning (TCL) algorithm that jointly infers a task-oriented reward and a task-agnostic constraint, enabling the generalization of learned skills. Our method TCL additively decomposes the overall reward into a task reward and its residual as soft constraints, maximizing policy divergence between task- and constraint-oriented policies to obtain a transferable constraint. Evaluating our method and four baselines in three simulated environments, we show TCL outperforms state-of-the-art IRL and ICL algorithms, achieving up to a $72\%$ higher task-success rates with accurate decomposition compared to the next best approach in novel scenarios. Further, we demonstrate the robustness of TCL on a real-world robotic tray-carrying task.
Abstract:This paper presents a novel algorithm for robot task and motion planning (TAMP) problems by utilizing a reachability tree. While tree-based algorithms are known for their speed and simplicity in motion planning (MP), they are not well-suited for TAMP problems that involve both abstracted and geometrical state variables. To address this challenge, we propose a hierarchical sampling strategy, which first generates an abstracted task plan using Monte Carlo tree search (MCTS) and then fills in the details with a geometrically feasible motion trajectory. Moreover, we show that the performance of the proposed method can be significantly enhanced by selecting an appropriate reward for MCTS and by using a pre-generated goal state that is guaranteed to be geometrically feasible. A comparative study using TAMP benchmark problems demonstrates the effectiveness of the proposed approach.
Abstract:Trajectory optimization (TO) aims to find a sequence of valid states while minimizing costs. However, its fine validation process is often costly due to computationally expensive collision searches, otherwise coarse searches lower the safety of the system losing a precise solution. To resolve the issues, we introduce a new collision-distance estimator, GraphDistNet, that can precisely encode the structural information between two geometries by leveraging edge feature-based convolutional operations, and also efficiently predict a batch of collision distances and gradients through 25,000 random environments with a maximum of 20 unforeseen objects. Further, we show the adoption of attention mechanism enables our method to be easily generalized in unforeseen complex geometries toward TO. Our evaluation show GraphDistNet outperforms state-of-the-art baseline methods in both simulated and real world tasks.
Abstract:We present a task-and-motion planning (TAMP) algorithm robust against a human operator's cooperative or adversarial interventions. Interventions often invalidate the current plan and require replanning on the fly. Replanning can be computationally expensive and often interrupts seamless task execution. We introduce a dynamically reconfigurable planning methodology with behavior tree-based control strategies toward reactive TAMP, which takes the advantage of previous plans and incremental graph search during temporal logic-based reactive synthesis. Our algorithm also shows efficient recovery functionalities that minimize the number of replanning steps. Finally, our algorithm produces a robust, efficient, and complete TAMP solution. Our experimental results show the algorithm results in superior manipulation performance in both simulated and real-world tasks.
Abstract:Eating is an essential activity of daily living (ADL) for staying healthy and living at home independently. Although numerous assistive devices have been introduced, many people with disabilities are still restricted from independent eating due to the devices' physical or perceptual limitations. In this work, we introduce a new meal-assistance system using a general-purpose mobile manipulator, a Willow Garage PR2, which has the potential to serve as a versatile form of assistive technology. Our active feeding framework enables the robot to autonomously deliver food to the user's mouth. In detail, our web-based user interface, visually-guided behaviors, and safety tools allow people with severe motor impairments to benefit from the robotic assistance. We evaluated our system with 10 able-bodied participants and 9 people with motor impairments. Both groups of participants successfully ate various foods using the system and reported high rates of success for the system's autonomous behaviors in a laboratory environment. Then, we performed in-home evaluation with Henry Evans, a person with quadriplegia, at his house in California, USA. In general, Henry and the other people who operated the system reported that it was comfortable, safe, and easy-to-use. We discuss learned lessons and design insights through user evaluations.
Abstract:Robots have the potential to assist people in bed, such as in healthcare settings, yet bedding materials like sheets and blankets can make observation of the human body difficult for robots. A pressure-sensing mat on a bed can provide pressure images that are relatively insensitive to bedding materials. However, prior work on estimating human pose from pressure images has been restricted to 2D pose estimates and flat beds. In this work, we present two convolutional neural networks to estimate the 3D joint positions of a person in a configurable bed from a single pressure image. The first network directly outputs 3D joint positions, while the second outputs a kinematic model that includes estimated joint angles and limb lengths. We evaluated our networks on data from 17 human participants with two bed configurations: supine and seated. Our networks achieved a mean joint position error of 77 mm when tested with data from people outside the training set, outperforming several baselines. We also present a simple mechanical model that provides insight into ambiguity associated with limbs raised off of the pressure mat, and demonstrate that Monte Carlo dropout can be used to estimate pose confidence in these situations. Finally, we provide a demonstration in which a mobile manipulator uses our network's estimated kinematic model to reach a location on a person's body in spite of the person being seated in a bed and covered by a blanket.