Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Edward Johns

One-Shot Dual-Arm Imitation Learning

Mar 10, 2025

Yilong Wang, Edward Johns

Abstract:We introduce One-Shot Dual-Arm Imitation Learning (ODIL), which enables dual-arm robots to learn precise and coordinated everyday tasks from just a single demonstration of the task. ODIL uses a new three-stage visual servoing (3-VS) method for precise alignment between the end-effector and target object, after which replay of the demonstration trajectory is sufficient to perform the task. This is achieved without requiring prior task or object knowledge, or additional data collection and training following the single demonstration. Furthermore, we propose a new dual-arm coordination paradigm for learning dual-arm tasks from a single demonstration. ODIL was tested on a real-world dual-arm robot, demonstrating state-of-the-art performance across six precise and coordinated tasks in both 4-DoF and 6-DoF settings, and showing robustness in the presence of distractor objects and partial occlusions. Videos are available at: https://www.robot-learning.uk/one-shot-dual-arm.

* Accepted at ICRA 2025. Project Webpage: https://www.robot-learning.uk/one-shot-dual-arm

Via

Access Paper or Ask Questions

Instant Policy: In-Context Imitation Learning via Graph Diffusion

Nov 19, 2024

Vitalis Vosylius, Edward Johns

Abstract:Following the impressive capabilities of in-context learning with large transformers, In-Context Imitation Learning (ICIL) is a promising opportunity for robotics. We introduce Instant Policy, which learns new tasks instantly (without further training) from just one or two demonstrations, achieving ICIL through two key components. First, we introduce inductive biases through a graph representation and model ICIL as a graph generation problem with a learned diffusion process, enabling structured reasoning over demonstrations, observations, and actions. Second, we show that such a model can be trained using pseudo-demonstrations - arbitrary trajectories generated in simulation - as a virtually infinite pool of training data. Simulated and real experiments show that Instant Policy enables rapid learning of various everyday robot tasks. We also show how it can serve as a foundation for cross-embodiment and zero-shot transfer to language-defined tasks. Code and videos are available at https://www.robot-learning.uk/instant-policy.

* Code and videos are available on our project webpage at https://www.robot-learning.uk/instant-policy

Via

Access Paper or Ask Questions

MILES: Making Imitation Learning Easy with Self-Supervision

Oct 25, 2024

Georgios Papagiannis, Edward Johns

Abstract:Data collection in imitation learning often requires significant, laborious human supervision, such as numerous demonstrations, and/or frequent environment resets for methods that incorporate reinforcement learning. In this work, we propose an alternative approach, MILES: a fully autonomous, self-supervised data collection paradigm, and we show that this enables efficient policy learning from just a single demonstration and a single environment reset. MILES autonomously learns a policy for returning to and then following the single demonstration, whilst being self-guided during data collection, eliminating the need for additional human interventions. We evaluated MILES across several real-world tasks, including tasks that require precise contact-rich manipulation such as locking a lock with a key. We found that, under the constraints of a single demonstration and no repeated environment resetting, MILES significantly outperforms state-of-the-art alternatives like imitation learning methods that leverage reinforcement learning. Videos of our experiments and code can be found on our webpage: www.robot-learning.uk/miles.

* Published at the Conference on Robot Learning (CoRL) 2024

Via

Access Paper or Ask Questions

Adapting Skills to Novel Grasps: A Self-Supervised Approach

Jul 31, 2024

Georgios Papagiannis, Kamil Dreczkowski, Vitalis Vosylius, Edward Johns

Figure 1 for Adapting Skills to Novel Grasps: A Self-Supervised Approach

Figure 2 for Adapting Skills to Novel Grasps: A Self-Supervised Approach

Figure 3 for Adapting Skills to Novel Grasps: A Self-Supervised Approach

Figure 4 for Adapting Skills to Novel Grasps: A Self-Supervised Approach

Abstract:In this paper, we study the problem of adapting manipulation trajectories involving grasped objects (e.g. tools) defined for a single grasp pose to novel grasp poses. A common approach to address this is to define a new trajectory for each possible grasp explicitly, but this is highly inefficient. Instead, we propose a method to adapt such trajectories directly while only requiring a period of self-supervised data collection, during which a camera observes the robot's end-effector moving with the object rigidly grasped. Importantly, our method requires no prior knowledge of the grasped object (such as a 3D CAD model), it can work with RGB images, depth images, or both, and it requires no camera calibration. Through a series of real-world experiments involving 1360 evaluations, we find that self-supervised RGB data consistently outperforms alternatives that rely on depth images including several state-of-the-art pose estimation methods. Compared to the best-performing baseline, our method results in an average of 28.5% higher success rate when adapting manipulation trajectories to novel grasps on several everyday tasks. Videos of the experiments are available on our webpage at https://www.robot-learning.uk/adapting-skills

* Accepted at IROS 2024

Via

Access Paper or Ask Questions

R+X: Retrieval and Execution from Everyday Human Videos

Jul 17, 2024

Georgios Papagiannis, Norman Di Palo, Pietro Vitiello, Edward Johns

Figure 1 for R+X: Retrieval and Execution from Everyday Human Videos

Figure 2 for R+X: Retrieval and Execution from Everyday Human Videos

Figure 3 for R+X: Retrieval and Execution from Everyday Human Videos

Figure 4 for R+X: Retrieval and Execution from Everyday Human Videos

Abstract:We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos are available at https://www.robot-learning.uk/r-plus-x.

Via

Access Paper or Ask Questions

Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

Mar 28, 2024

Norman Di Palo, Edward Johns

Figure 1 for Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

Figure 2 for Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

Figure 3 for Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

Figure 4 for Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

Abstract:We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping visual observations to action sequences that emulate the demonstrator's behaviour. We achieve this by transforming visual observations (inputs) and trajectories of actions (outputs) into sequences of tokens that a text-pretrained Transformer (GPT-4 Turbo) can ingest and generate, via a framework we call Keypoint Action Tokens (KAT). Despite being trained only on language, we show that these Transformers excel at translating tokenised visual keypoint observations into action trajectories, performing on par or better than state-of-the-art imitation learning (diffusion policies) in the low-data regime on a suite of real-world, everyday tasks. Rather than operating in the language domain as is typical, KAT leverages text-based Transformers to operate in the vision and action domains to learn general patterns in demonstration data for highly efficient imitation learning, indicating promising new avenues for repurposing natural language models for embodied tasks. Videos are available at https://www.robot-learning.uk/keypoint-action-tokens.

Via

Access Paper or Ask Questions

DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models

Feb 20, 2024

Norman Di Palo, Edward Johns

Abstract:We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align its end-effector with the novel object to enable effective interaction. Through a series of real-world experiments on everyday tasks, we show that exploiting both the image-level and pixel-level properties of vision foundation models enables unprecedented learning efficiency and generalisation. Videos and code are available at https://www.robot-learning.uk/dinobot.

* To appear at 2024 IEEE International Conference on Robotics and Automation (ICRA)

Via

Access Paper or Ask Questions

On the Effectiveness of Retrieval, Alignment, and Replay in Manipulation

Dec 19, 2023

Norman Di Palo, Edward Johns

Abstract:Imitation learning with visual observations is notoriously inefficient when addressed with end-to-end behavioural cloning methods. In this paper, we explore an alternative paradigm which decomposes reasoning into three phases. First, a retrieval phase, which informs the robot what it can do with an object. Second, an alignment phase, which informs the robot where to interact with the object. And third, a replay phase, which informs the robot how to interact with the object. Through a series of real-world experiments on everyday tasks, such as grasping, pouring, and inserting objects, we show that this decomposition brings unprecedented learning efficiency, and effective inter- and intra-class generalisation. Videos are available at https://www.robot-learning.uk/retrieval-alignment-replay.

* Published in IEEE Robotics and Automation Letters (RA-L). (Accepted December 2023)

Via

Access Paper or Ask Questions

Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

Dec 07, 2023

Ivan Kapelyukh, Yifei Ren, Ignacio Alzugaray, Edward Johns

Figure 1 for Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

Figure 2 for Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

Figure 3 for Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

Figure 4 for Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

Abstract:We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks.

* Project webpage with videos: https://www.robot-learning.uk/dream2real

Via

Access Paper or Ask Questions

SceneScore: Learning a Cost Function for Object Arrangement

Nov 14, 2023

Ivan Kapelyukh, Edward Johns

Abstract:Arranging objects correctly is a key capability for robots which unlocks a wide range of useful tasks. A prerequisite for creating successful arrangements is the ability to evaluate the desirability of a given arrangement. Our method "SceneScore" learns a cost function for arrangements, such that desirable, human-like arrangements have a low cost. We learn the distribution of training arrangements offline using an energy-based model, solely from example images without requiring environment interaction or human supervision. Our model is represented by a graph neural network which learns object-object relations, using graphs constructed from images. Experiments demonstrate that the learned cost function can be used to predict poses for missing objects, generalise to novel objects using semantic features, and can be composed with other cost functions to satisfy constraints at inference time.

* Presented at CoRL 2023 LEAP Workshop. Webpage: https://sites.google.com/view/scenescore

Via

Access Paper or Ask Questions