Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Peter Pastor

Gemini Robotics: Bringing AI into the Physical World

Mar 25, 2025

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl(+108 more)

Abstract:Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Language-Action (VLA) generalist model capable of directly controlling robots. Gemini Robotics executes smooth and reactive movements to tackle a wide range of complex manipulation tasks while also being robust to variations in object types and positions, handling unseen environments as well as following diverse, open vocabulary instructions. We show that with additional fine-tuning, Gemini Robotics can be specialized to new capabilities including solving long-horizon, highly dexterous tasks, learning new short-horizon tasks from as few as 100 demonstrations and adapting to completely novel robot embodiments. This is made possible because Gemini Robotics builds on top of the Gemini Robotics-ER model, the second model we introduce in this work. Gemini Robotics-ER (Embodied Reasoning) extends Gemini's multimodal reasoning capabilities into the physical world, with enhanced spatial and temporal understanding. This enables capabilities relevant to robotics including object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions. We show how this novel combination can support a variety of robotics applications. We also discuss and address important safety considerations related to this new class of robotics foundation models. The Gemini Robotics family marks a substantial step towards developing general-purpose robots that realizes AI's potential in the physical world.

Via

Access Paper or Ask Questions

Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators

May 05, 2023

Alexander Herzog, Kanishka Rao, Karol Hausman, Yao Lu, Paul Wohlhart, Mengyuan Yan, Jessica Lin, Montserrat Gonzalez Arenas, Ted Xiao, Daniel Kappler(+30 more)

Figure 1 for Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators

Figure 2 for Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators

Figure 3 for Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators

Figure 4 for Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators

Abstract:We describe a system for deep reinforcement learning of robotic manipulation skills applied to a large-scale real-world task: sorting recyclables and trash in office buildings. Real-world deployment of deep RL policies requires not only effective training algorithms, but the ability to bootstrap real-world training and enable broad generalization. To this end, our system combines scalable deep RL from real-world data with bootstrapping from training in simulation, and incorporates auxiliary inputs from existing computer vision systems as a way to boost generalization to novel objects, while retaining the benefits of end-to-end training. We analyze the tradeoffs of different design decisions in our system, and present a large-scale empirical validation that includes training on real-world data gathered over the course of 24 months of experimentation, across a fleet of 23 robots in three office buildings, with a total training set of 9527 hours of robotic experience. Our final validation also consists of 4800 evaluation trials across 240 waste station configurations, in order to evaluate in detail the impact of the design decisions in our system, the scaling effects of including more real-world data, and the performance of the method on novel objects. The projects website and videos can be found at \href{http://rl-at-scale.github.io}{rl-at-scale.github.io}.

* Published at Robotics: Science and Systems 2023

Via

Access Paper or Ask Questions

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Apr 04, 2022

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog(+33 more)

Figure 1 for Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Figure 2 for Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Figure 3 for Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Figure 4 for Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Abstract:Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's "hands and eyes," while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website and the video can be found at https://say-can.github.io/

* See website at https://say-can.github.io/

Via

Access Paper or Ask Questions

How to Train Your Robot with Deep Reinforcement Learning; Lessons We've Learned

Feb 04, 2021

Julian Ibarz, Jie Tan, Chelsea Finn, Mrinal Kalakrishnan, Peter Pastor, Sergey Levine

Figure 1 for How to Train Your Robot with Deep Reinforcement Learning; Lessons We've Learned

Figure 2 for How to Train Your Robot with Deep Reinforcement Learning; Lessons We've Learned

Figure 3 for How to Train Your Robot with Deep Reinforcement Learning; Lessons We've Learned

Figure 4 for How to Train Your Robot with Deep Reinforcement Learning; Lessons We've Learned

Abstract:Deep reinforcement learning (RL) has emerged as a promising approach for autonomously acquiring complex behaviors from low level sensor observations. Although a large portion of deep RL research has focused on applications in video games and simulated control, which does not connect with the constraints of learning in real environments, deep RL has also demonstrated promise in enabling physical robots to learn complex skills in the real world. At the same time,real world robotics provides an appealing domain for evaluating such algorithms, as it connects directly to how humans learn; as an embodied agent in the real world. Learning to perceive and move in the real world presents numerous challenges, some of which are easier to address than others, and some of which are often not considered in RL research that focuses only on simulated domains. In this review article, we present a number of case studies involving robotic deep RL. Building off of these case studies, we discuss commonly perceived challenges in deep RL and how they have been addressed in these works. We also provide an overview of other outstanding challenges, many of which are unique to the real-world robotics setting and are not often the focus of mainstream RL research. Our goal is to provide a resource both for roboticists and machine learning researchers who are interested in furthering the progress of deep RL in the real world.

* Journal of Robotics Research (IJRR), February 2021

Via

Access Paper or Ask Questions

Quantile QT-Opt for Risk-Aware Vision-Based Robotic Grasping

Oct 01, 2019

Cristian Bodnar, Adrian Li, Karol Hausman, Peter Pastor, Mrinal Kalakrishnan

Figure 1 for Quantile QT-Opt for Risk-Aware Vision-Based Robotic Grasping

Figure 2 for Quantile QT-Opt for Risk-Aware Vision-Based Robotic Grasping

Figure 3 for Quantile QT-Opt for Risk-Aware Vision-Based Robotic Grasping

Figure 4 for Quantile QT-Opt for Risk-Aware Vision-Based Robotic Grasping

Abstract:The distributional perspective on reinforcement learning (RL) has given rise to a series of successful Q-learning algorithms, resulting in state-of-the-art performance in arcade game environments. However, it has not yet been analyzed how these findings from a discrete setting translate to complex practical applications characterized by noisy, high dimensional and continuous state-action spaces. In this work, we propose Quantile QT-Opt (Q2-Opt), a distributional variant of the recently introduced distributed Q-learning algorithm for continuous domains, and examine its behaviour in a series of simulated and real vision-based robotic grasping tasks. The absence of an actor in Q2-Opt allows us to directly draw a parallel to the previous discrete experiments in the literature without the additional complexities induced by an actor-critic architecture. We demonstrate that Q2-Opt achieves a superior vision-based object grasping success rate, while also being more sample efficient. The distributional formulation also allows us to experiment with various risk-distortion metrics that give us an indication of how robots can concretely manage risk in practice using a Deep RL control policy. As an additional contribution, we perform experiments on offline datasets and compare them with the latest findings from discrete settings. Surprisingly, we find that there is a discrepancy between our results and the previous batch RL findings from the literature obtained on arcade game environments.

* Under review at ICRA 2020

Via

Access Paper or Ask Questions

Learning Probabilistic Multi-Modal Actor Models for Vision-Based Robotic Grasping

Apr 15, 2019

Mengyuan Yan, Adrian Li, Mrinal Kalakrishnan, Peter Pastor

Figure 1 for Learning Probabilistic Multi-Modal Actor Models for Vision-Based Robotic Grasping

Figure 2 for Learning Probabilistic Multi-Modal Actor Models for Vision-Based Robotic Grasping

Figure 3 for Learning Probabilistic Multi-Modal Actor Models for Vision-Based Robotic Grasping

Figure 4 for Learning Probabilistic Multi-Modal Actor Models for Vision-Based Robotic Grasping

Abstract:Many previous works approach vision-based robotic grasping by training a value network that evaluates grasp proposals. These approaches require an optimization process at run-time to infer the best action from the value network. As a result, the inference time grows exponentially as the dimension of action space increases. We propose an alternative method, by directly training a neural density model to approximate the conditional distribution of successful grasp poses from the input images. We construct a neural network that combines Gaussian mixture and normalizing flows, which is able to represent multi-modal, complex probability distributions. We demonstrate on both simulation and real robot that the proposed actor model achieves similar performance compared to the value network using the Cross-Entropy Method (CEM) for inference, on top-down grasping with a 4 dimensional action space. Our actor model reduces the inference time by 3 times compared to the state-of-the-art CEM method. We believe that actor models will play an important role when scaling up these approaches to higher dimensional action spaces.

* The 2019 International Conference on Robotics and Automation (ICRA)

Via

Access Paper or Ask Questions

QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Nov 28, 2018

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke(+1 more)

Figure 1 for QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Figure 2 for QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Figure 3 for QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Figure 4 for QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Abstract:In this paper, we study the problem of learning vision-based dynamic manipulation skills using a scalable reinforcement learning approach. We study this problem in the context of grasping, a longstanding challenge in robotic manipulation. In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, our method enables closed-loop vision-based control, whereby the robot continuously updates its grasp strategy based on the most recent observations to optimize long-horizon grasp success. To that end, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network Q-function with over 1.2M parameters to perform closed-loop, real-world grasping that generalizes to 96% grasp success on unseen objects. Aside from attaining a very high success rate, our method exhibits behaviors that are quite distinct from more standard grasping systems: using only RGB vision-based perception from an over-the-shoulder camera, our method automatically learns regrasping strategies, probes objects to find the most effective grasps, learns to reposition objects and perform other non-prehensile pre-grasp manipulations, and responds dynamically to disturbances and perturbations.

* CoRL 2018 camera ready. 23 pages, 14 figures

Via

Access Paper or Ask Questions

End-to-End Learning of Semantic Grasping

Nov 09, 2017

Eric Jang, Sudheendra Vijayanarasimhan, Peter Pastor, Julian Ibarz, Sergey Levine

Figure 1 for End-to-End Learning of Semantic Grasping

Figure 2 for End-to-End Learning of Semantic Grasping

Figure 3 for End-to-End Learning of Semantic Grasping

Figure 4 for End-to-End Learning of Semantic Grasping

Abstract:We consider the task of semantic robotic grasping, in which a robot picks up an object of a user-specified class using only monocular images. Inspired by the two-stream hypothesis of visual reasoning, we present a semantic grasping framework that learns object detection, classification, and grasp planning in an end-to-end fashion. A "ventral stream" recognizes object class while a "dorsal stream" simultaneously interprets the geometric relationships necessary to execute successful grasps. We leverage the autonomous data collection capabilities of robots to obtain a large self-supervised dataset for training the dorsal stream, and use semi-supervised label propagation to train the ventral stream with only a modest amount of human supervision. We experimentally show that our approach improves upon grasping systems whose components are not learned end-to-end, including a baseline method that uses bounding box detection. Furthermore, we show that jointly training our model with auxiliary data consisting of non-semantic grasping data, as well as semantically labeled images without grasp actions, has the potential to substantially improve semantic grasping performance.

* 14 pages

Via

Access Paper or Ask Questions

Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping

Sep 25, 2017

Konstantinos Bousmalis, Alex Irpan, Paul Wohlhart, Yunfei Bai, Matthew Kelcey, Mrinal Kalakrishnan, Laura Downs, Julian Ibarz, Peter Pastor, Kurt Konolige(+2 more)

Figure 1 for Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping

Figure 2 for Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping

Figure 3 for Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping

Figure 4 for Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping

Abstract:Instrumenting and collecting annotated visual grasping datasets to train modern machine learning algorithms can be extremely time-consuming and expensive. An appealing alternative is to use off-the-shelf simulators to render synthetic data for which ground-truth annotations are generated automatically. Unfortunately, models trained purely on simulated data often fail to generalize to the real world. We study how randomized simulated environments and domain adaptation methods can be extended to train a grasping system to grasp novel objects from raw monocular RGB images. We extensively evaluate our approaches with a total of more than 25,000 physical test grasps, studying a range of simulation conditions and domain adaptation methods, including a novel extension of pixel-level domain adaptation that we term the GraspGAN. We show that, by using synthetic data and domain adaptation, we are able to reduce the number of real-world samples needed to achieve a given level of performance by up to 50 times, using only randomly generated simulated objects. We also show that by using only unlabeled real-world data and our GraspGAN methodology, we obtain real-world grasping performance without any real-world labels that is similar to that achieved with 939,777 labeled real-world samples.

* 9 pages, 5 figures, 3 tables

Via

Access Paper or Ask Questions

Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection

Aug 28, 2016

Sergey Levine, Peter Pastor, Alex Krizhevsky, Deirdre Quillen

Figure 1 for Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection

Figure 2 for Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection

Figure 3 for Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection

Figure 4 for Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection

Abstract:We describe a learning-based approach to hand-eye coordination for robotic grasping from monocular images. To learn hand-eye coordination for grasping, we trained a large convolutional neural network to predict the probability that task-space motion of the gripper will result in successful grasps, using only monocular camera images and independently of camera calibration or the current robot pose. This requires the network to observe the spatial relationship between the gripper and objects in the scene, thus learning hand-eye coordination. We then use this network to servo the gripper in real time to achieve successful grasps. To train our network, we collected over 800,000 grasp attempts over the course of two months, using between 6 and 14 robotic manipulators at any given time, with differences in camera placement and hardware. Our experimental evaluation demonstrates that our method achieves effective real-time control, can successfully grasp novel objects, and corrects mistakes by continuous servoing.

* This is an extended version of "Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection," ISER 2016. Draft modified to correct typo in Algorithm 1 and add a link to the publicly available dataset

Via

Access Paper or Ask Questions