Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Maria Attarian

GeoMatch++: Morphology Conditioned Geometry Matching for Multi-Embodiment Grasping

Dec 25, 2024

Yunze Wei, Maria Attarian, Igor Gilitschenski

Abstract:Despite recent progress on multi-finger dexterous grasping, current methods focus on single grippers and unseen objects, and even the ones that explore cross-embodiment, often fail to generalize well to unseen end-effectors. This work addresses the problem of dexterous grasping generalization to unseen end-effectors via a unified policy that learns correlation between gripper morphology and object geometry. Robot morphology contains rich information representing how joints and links connect and move with respect to each other and thus, we leverage it through attention to learn better end-effector geometry features. Our experiments show an average of 9.64% increase in grasp success rate across 3 out-of-domain end-effectors compared to previous methods.

* 8 pages, 3 figures, CoRL Workshop on Learning Robot Fine and Dexterous Manipulation: Perception and Control

Via

Access Paper or Ask Questions

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Mar 19, 2024

Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan(+3 more)

Figure 1 for Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Figure 2 for Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Figure 3 for Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Figure 4 for Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Abstract:While large-scale robotic systems typically rely on textual instructions for tasks, this work explores a different approach: can robots infer the task directly from observing humans? This shift necessitates the robot's ability to decode human intent and translate it into executable actions within its physical constraints and environment. We introduce Vid2Robot, a novel end-to-end video-based learning framework for robots. Given a video demonstration of a manipulation task and current visual observations, Vid2Robot directly produces robot actions. This is achieved through a unified representation model trained on a large dataset of human video and robot trajectory. The model leverages cross-attention mechanisms to fuse prompt video features to the robot's current state and generate appropriate actions that mimic the observed task. To further improve policy performance, we propose auxiliary contrastive losses that enhance the alignment between human and robot video representations. We evaluate Vid2Robot on real-world robots, demonstrating a 20% improvement in performance compared to other video-conditioned policies when using human demonstration videos. Additionally, our model exhibits emergent capabilities, such as successfully transferring observed motions from one object to another, and long-horizon composition, thus showcasing its potential for real-world applications. Project website: vid2robot.github.io

* Robot learning: Imitation Learning, Robot Perception, Sensing & Vision, Grasping & Manipulation

Via

Access Paper or Ask Questions

Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Feb 18, 2024

Jacky Liang, Fei Xia, Wenhao Yu, Andy Zeng, Montserrat Gonzalez Arenas, Maria Attarian, Maria Bauza, Matthew Bennice, Alex Bewley, Adil Dostmohamed(+40 more)

Figure 1 for Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Figure 2 for Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Figure 3 for Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Figure 4 for Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Abstract:Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for only as long as it fits within the context size of the LLM, and can be forgotten over longer interactions. In this work, we investigate fine-tuning the robot code-writing LLMs, to remember their in-context interactions and improve their teachability i.e., how efficiently they adapt to human inputs (measured by average number of corrections before the user considers the task successful). Our key observation is that when human-robot interactions are formulated as a partially observable Markov decision process (in which human language inputs are observations, and robot code outputs are actions), then training an LLM to complete previous interactions can be viewed as training a transition dynamics model -- that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments -- improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%. See videos, code, and demos at: https://robot-teaching.github.io/.

Via

Access Paper or Ask Questions

Geometry Matching for Multi-Embodiment Grasping

Dec 06, 2023

Maria Attarian, Muhammad Adil Asif, Jingzhou Liu, Ruthrash Hari, Animesh Garg, Igor Gilitschenski, Jonathan Tompson

Figure 1 for Geometry Matching for Multi-Embodiment Grasping

Figure 2 for Geometry Matching for Multi-Embodiment Grasping

Figure 3 for Geometry Matching for Multi-Embodiment Grasping

Figure 4 for Geometry Matching for Multi-Embodiment Grasping

Abstract:Many existing learning-based grasping approaches concentrate on a single embodiment, provide limited generalization to higher DoF end-effectors and cannot capture a diverse set of grasp modes. We tackle the problem of grasping using multiple embodiments by learning rich geometric representations for both objects and end-effectors using Graph Neural Networks. Our novel method - GeoMatch - applies supervised learning on grasping data from multiple embodiments, learning end-to-end contact point likelihood maps as well as conditional autoregressive predictions of grasps keypoint-by-keypoint. We compare our method against baselines that support multiple embodiments. Our approach performs better across three end-effectors, while also producing diverse grasps. Examples, including real robot demos, can be found at geo-match.github.io.

* 7th Annual Conference on Robot Learning, 2023

Via

Access Paper or Ask Questions

See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

Oct 07, 2022

Maria Attarian, Advaya Gupta, Ziyi Zhou, Wei Yu, Igor Gilitschenski, Animesh Garg

Figure 1 for See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

Figure 2 for See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

Figure 3 for See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

Figure 4 for See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

Abstract:Cognitive planning is the structural decomposition of complex tasks into a sequence of future behaviors. In the computational setting, performing cognitive planning entails grounding plans and concepts in one or more modalities in order to leverage them for low level control. Since real-world tasks are often described in natural language, we devise a cognitive planning algorithm via language-guided video prediction. Current video prediction models do not support conditioning on natural language instructions. Therefore, we propose a new video prediction architecture which leverages the power of pre-trained transformers.The network is endowed with the ability to ground concepts based on natural language input with generalization to unseen objects. We demonstrate the effectiveness of this approach on a new simulation dataset, where each task is defined by a high-level action described in natural language. Our experiments compare our method again stone video generation baseline without planning or action grounding and showcase significant improvements. Our ablation studies highlight an improved generalization to unseen objects that natural language embeddings offer to concept grounding ability, as well as the importance of planning towards visual "imagination" of a task.

Via

Access Paper or Ask Questions

Transporter Networks: Rearranging the Visual World for Robotic Manipulation

Oct 27, 2020

Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Vikas Sindhwani(+1 more)

Figure 1 for Transporter Networks: Rearranging the Visual World for Robotic Manipulation

Figure 2 for Transporter Networks: Rearranging the Visual World for Robotic Manipulation

Figure 3 for Transporter Networks: Rearranging the Visual World for Robotic Manipulation

Figure 4 for Transporter Networks: Rearranging the Visual World for Robotic Manipulation

Abstract:Robotic manipulation can be formulated as inducing a sequence of spatial displacements: where the space being moved can encompass an object, part of an object, or end effector. In this work, we propose the Transporter Network, a simple model architecture that rearranges deep features to infer spatial displacements from visual input - which can parameterize robot actions. It makes no assumptions of objectness (e.g. canonical poses, models, or keypoints), it exploits spatial symmetries, and is orders of magnitude more sample efficient than our benchmarked alternatives in learning vision-based manipulation tasks: from stacking a pyramid of blocks, to assembling kits with unseen objects; from manipulating deformable ropes, to pushing piles of small objects with closed-loop feedback. Our method can represent complex multi-modal policy distributions and generalizes to multi-step sequential tasks, as well as 6DoF pick-and-place. Experiments on 10 simulated tasks show that it learns faster and generalizes better than a variety of end-to-end baselines, including policies that use ground-truth object poses. We validate our methods with hardware in the real world. Experiment videos and code will be made available at https://transporternets.github.io

* Project webpage: https://transporternets.github.io

Via

Access Paper or Ask Questions

Transforming Neural Network Visual Representations to Predict Human Judgments of Similarity

Oct 13, 2020

Maria Attarian, Brett D. Roads, Michael C. Mozer

Figure 1 for Transforming Neural Network Visual Representations to Predict Human Judgments of Similarity

Figure 2 for Transforming Neural Network Visual Representations to Predict Human Judgments of Similarity

Figure 3 for Transforming Neural Network Visual Representations to Predict Human Judgments of Similarity

Abstract:Deep-learning vision models have shown intriguing similarities and differences with respect to human vision. We investigate how to bring machine visual representations into better alignment with human representations. Human representations are often inferred from behavioral evidence such as the selection of an image most similar to a query image. We find that with appropriate linear transformations of deep embeddings, we can improve prediction of human binary choice on a data set of bird images from 72% at baseline to 89%. We hypothesized that deep embeddings have redundant, high (4096) dimensional representations; however, reducing the rank of these representations results in a loss of explanatory power. We hypothesized that the dilation transformation of representations explored in past research is too restrictive, and indeed we found that model explanatory power can be significantly improved with a more expressive linear transform. Most surprising and exciting, we found that, consistent with classic psychological literature, human similarity judgments are asymmetric: the similarity of X to Y is not necessarily equal to the similarity of Y to X, and allowing models to express this asymmetry improves explanatory power.

Via

Access Paper or Ask Questions

Combining Learned Lyrical Structures and Vocabulary for Improved Lyric Generation

Nov 12, 2018

Pablo Samuel Castro, Maria Attarian

Figure 1 for Combining Learned Lyrical Structures and Vocabulary for Improved Lyric Generation

Figure 2 for Combining Learned Lyrical Structures and Vocabulary for Improved Lyric Generation

Figure 3 for Combining Learned Lyrical Structures and Vocabulary for Improved Lyric Generation

Abstract:The use of language models for generating lyrics and poetry has received an increased interest in the last few years. They pose a unique challenge relative to standard natural language problems, as their ultimate purpose is reative, notions of accuracy and reproducibility are secondary to notions of lyricism, structure, and diversity. In this creative setting, traditional quantitative measures for natural language problems, such as BLEU scores, prove inadequate: a high-scoring model may either fail to produce output respecting the desired structure (e.g. song verses), be a terribly boring creative companion, or both. In this work we propose a mechanism for combining two separately trained language models into a framework that is able to produce output respecting the desired song structure, while providing a richness and diversity of vocabulary that renders it more creatively appealing.

* Extended abstract (2 pages) for the NIPS 2018 Second Workshop on Machine Learning for Creativity and Design

Via

Access Paper or Ask Questions