Abstract:As robotic systems become increasingly integrated into complex real-world environments, there is a growing need for approaches that enable robots to understand and act upon natural language instructions without relying on extensive pre-programmed knowledge of their surroundings. This paper presents PLATO, an innovative system that addresses this challenge by leveraging specialized large language model agents to process natural language inputs, understand the environment, predict tool affordances, and generate executable actions for robotic systems. Unlike traditional systems that depend on hard-coded environmental information, PLATO employs a modular architecture of specialized agents to operate without any initial knowledge of the environment. These agents identify objects and their locations within the scene, generate a comprehensive high-level plan, translate this plan into a series of low-level actions, and verify the completion of each step. The system is particularly tested on challenging tool-use tasks, which involve handling diverse objects and require long-horizon planning. PLATO's design allows it to adapt to dynamic and unstructured settings, significantly enhancing its flexibility and robustness. By evaluating the system across various complex scenarios, we demonstrate its capability to tackle a diverse range of tasks and offer a novel solution to integrate LLMs with robotic platforms, advancing the state-of-the-art in autonomous robotic task execution. For videos and prompt details, please see our project website: https://sites.google.com/andrew.cmu.edu/plato
Abstract:When humans create sculptures, we are able to reason about how geometrically we need to alter the clay state to reach our target goal. We are not computing point-wise similarity metrics, or reasoning about low-level positioning of our tools, but instead determining the higher-level changes that need to be made. In this work, we propose LLM-Craft, a novel pipeline that leverages large language models (LLMs) to iteratively reason about and generate deformation-based crafting action sequences. We simplify and couple the state and action representations to further encourage shape-based reasoning. To the best of our knowledge, LLM-Craft is the first system successfully leveraging LLMs for complex deformable object interactions. Through our experiments, we demonstrate that with the LLM-Craft framework, LLMs are able to successfully reason about the deformation behavior of elasto-plastic objects. Furthermore, we find that LLM-Craft is able to successfully create a set of simple letter shapes. Finally, we explore extending the framework to reaching more ambiguous semantic goals, such as "thinner" or "bumpy". For videos please see our website: https://sites.google.com/andrew.cmu.edu/llmcraft.
Abstract:Manipulating deformable objects remains a challenge within robotics due to the difficulties of state estimation, long-horizon planning, and predicting how the object will deform given an interaction. These challenges are the most pronounced with 3D deformable objects. We propose SculptDiff, a goal-conditioned diffusion-based imitation learning framework that works with point cloud state observations to directly learn clay sculpting policies for a variety of target shapes. To the best of our knowledge this is the first real-world method that successfully learns manipulation policies for 3D deformable objects. For sculpting videos and access to our dataset and hardware CAD models, see the project website: https://sites.google.com/andrew.cmu.edu/imitation-sculpting/home
Abstract:With the growing emphasis on the development and integration of service robots within household environments, we will need to endow robots with the ability to reliably pour a variety of liquids. However, liquid handling and pouring is a challenging task due to the complex dynamics and varying properties of different liquids, the exacting precision required to prevent spills and ensure accurate pouring, and the necessity for robots to adapt seamlessly to a multitude of containers in real-world scenarios. In response to these challenges, we propose a novel autonomous robotics pipeline that empowers robots to execute precision pouring tasks, encompassing both carbonated and non-carbonated liquids, as well as opaque and transparent liquids, into a variety of transparent containers. Our proposed approach maximizes the potential of RGB input alone, achieving zero-shot capability by harnessing existing pre-trained vision segmentation models. This eliminates the need for additional data collection, manual image annotations, or extensive training. Furthermore, our work integrates ChatGPT, facilitating seamless interaction between individuals without prior expertise in robotics and our pouring pipeline, this integration enables users to effortlessly request and execute pouring actions. Our experiments demonstrate the pipeline's capability to successfully pour a diverse range of carbonated and non-carbonated beverages into containers of varying sizes, relying solely on visual input.
Abstract:Deformable object manipulation presents a unique set of challenges in robotic manipulation by exhibiting high degrees of freedom and severe self-occlusion. State representation for materials that exhibit plastic behavior, like modeling clay or bread dough, is also difficult because they permanently deform under stress and are constantly changing shape. In this work, we investigate each of these challenges using the task of robotic sculpting with a parallel gripper. We propose a system that uses point clouds as the state representation and leverages pre-trained point cloud reconstruction Transformer to learn a latent dynamics model to predict material deformations given a grasp action. We design a novel action sampling algorithm that reasons about geometrical differences between point clouds to further improve the efficiency of model-based planners. All data and experiments are conducted entirely in the real world. Our experiments show the proposed system is able to successfully capture the dynamics of clay, and is able to create a variety of simple shapes.
Abstract:Inferring liquid properties from vision is a challenging task due to the complex nature of fluids, both in behavior and detection. Nevertheless, the ability to infer their properties directly from visual information is highly valuable for autonomous fluid handling systems, as cameras are readily available. Moreover, predicting fluid properties purely from vision can accelerate the process of fluid characterization saving considerable time and effort in various experimental environments. In this work, we present a purely vision-based approach to estimate viscosity, leveraging the fact that the behavior of the fluid oscillations is directly related to the viscosity. Specifically, we utilize a 3D convolutional autoencoder to learn latent representations of different fluid-oscillating patterns present in videos. We leverage this latent representation to visually infer the category of fluid or the dynamics viscosity of fluid from video.
Abstract:With the goal of developing fully autonomous cooking robots, developing robust systems that can chop a wide variety of objects is important. Existing approaches focus primarily on the low-level dynamics of the cutting action, which overlooks some of the practical real-world challenges of implementing autonomous cutting systems. In this work we propose an autonomous framework to sequence together action primitives for the purpose of chopping fruits and vegetables on a cluttered cutting board. We present a novel technique to leverage vision foundational models SAM and YOLO to accurately detect, segment, and track fruits and vegetables as they visually change through the sequences of chops, finetuning YOLO on a novel dataset of whole and chopped fruits and vegetables. In our experiments, we demonstrate that our simple pipeline is able to reliably chop a variety of fruits and vegetables ranging in size, appearance, and texture, meeting a variety of chopping specifications, including fruit type, number of slices, and types of slices.
Abstract:Across the robotics field, quality demonstrations are an integral part of many control pipelines. However, collecting high-quality demonstration trajectories remains time-consuming and difficult, often resulting in the number of demonstrations being the performance bottleneck. To address this issue, we present a method of Virtual Reality (VR) Teleoperation that uses an Oculus VR headset to teleoperate a Franka Emika Panda robot. Although other VR teleoperation methods exist, our code is open source, designed for readily available consumer hardware, easy to modify, agnostic to experimental setup, and simple to use.
Abstract:The use of human demonstrations in reinforcement learning has proven to significantly improve agent performance. However, any requirement for a human to manually 'teach' the model is somewhat antithetical to the goals of reinforcement learning. This paper attempts to minimize human involvement in the learning process while still retaining the performance advantages by using a single human example collected through a simple-to-use virtual reality simulation to assist with RL training. Our method augments a single demonstration to generate numerous human-like demonstrations that, when combined with Deep Deterministic Policy Gradients and Hindsight Experience Replay (DDPG + HER), significantly improve training time on simple tasks and allows the agent to solve a complex task (block stacking) that DDPG + HER alone cannot solve. The model achieves this significant training advantage using a single human example, requiring less than a minute of human input.
Abstract:Learning control policies with large action spaces is a challenging problem in the field of reinforcement learning due to present inefficiencies in exploration. In this work, we introduce a Deep Reinforcement Learning (DRL) algorithm call Multi-Action Networks (MAN) Learning that addresses the challenge of large discrete action spaces. We propose separating the action space into two components, creating a Value Neural Network for each sub-action. Then, MAN uses temporal-difference learning to train the networks synchronously, which is simpler than training a single network with a large action output directly. To evaluate the proposed method, we test MAN on a block stacking task, and then extend MAN to handle 12 games from the Atari Arcade Learning environment with 18 action spaces. Our results indicate that MAN learns faster than both Deep Q-Learning and Double Deep Q-Learning, implying our method is a better performing synchronous temporal difference algorithm than those currently available for large action spaces.