Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nick Heppert

cVLA: Towards Efficient Camera-Space VLAs

Jul 02, 2025

Max Argus, Jelena Bratulic, Houman Masnavi, Maxim Velikanov, Nick Heppert, Abhinav Valada, Thomas Brox

Abstract:Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-effector poses in image frame coordinates. Unlike prior VLA models that output low-level controls, our model predicts trajectory waypoints, making it both more efficient to train and robot embodiment agnostic. Despite its lightweight design, our next-token prediction architecture effectively learns meaningful and executable robot trajectories. We further explore the underutilized potential of incorporating depth images, inference-time techniques such as decoding strategies, and demonstration-conditioned action generation. Our model is trained on a simulated dataset and exhibits strong sim-to-real transfer capabilities. We evaluate our approach using a combination of simulated and real data, demonstrating its effectiveness on a real robotic system.

* 20 pages, 10 figures

Via

Access Paper or Ask Questions

Neural Fields in Robotics: A Survey

Oct 26, 2024

Muhammad Zubair Irshad, Mauro Comi, Yen-Chen Lin, Nick Heppert, Abhinav Valada, Rares Ambrus, Zsolt Kira, Jonathan Tremblay

Abstract:Neural Fields have emerged as a transformative approach for 3D scene representation in computer vision and robotics, enabling accurate inference of geometry, 3D semantics, and dynamics from posed 2D data. Leveraging differentiable rendering, Neural Fields encompass both continuous implicit and explicit neural representations enabling high-fidelity 3D reconstruction, integration of multi-modal sensor data, and generation of novel viewpoints. This survey explores their applications in robotics, emphasizing their potential to enhance perception, planning, and control. Their compactness, memory efficiency, and differentiability, along with seamless integration with foundation and generative models, make them ideal for real-time applications, improving robot adaptability and decision-making. This paper provides a thorough review of Neural Fields in robotics, categorizing applications across various domains and evaluating their strengths and limitations, based on over 200 papers. First, we present four key Neural Fields frameworks: Occupancy Networks, Signed Distance Fields, Neural Radiance Fields, and Gaussian Splatting. Second, we detail Neural Fields' applications in five major robotics domains: pose estimation, manipulation, navigation, physics, and autonomous driving, highlighting key works and discussing takeaways and open challenges. Finally, we outline the current limitations of Neural Fields in robotics and propose promising directions for future research. Project page: https://robonerf.github.io

* 20 pages, 20 figures. Project Page: https://robonerf.github.io

Via

Access Paper or Ask Questions

Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching

Sep 11, 2024

Eugenio Chisari, Nick Heppert, Max Argus, Tim Welschehold, Thomas Brox, Abhinav Valada

Figure 1 for Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching

Figure 2 for Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching

Figure 3 for Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching

Figure 4 for Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching

Abstract:Learning from expert demonstrations is a promising approach for training robotic manipulation policies from limited data. However, imitation learning algorithms require a number of design choices ranging from the input modality, training objective, and 6-DoF end-effector pose representation. Diffusion-based methods have gained popularity as they enable predicting long-horizon trajectories and handle multimodal action distributions. Recently, Conditional Flow Matching (CFM) (or Rectified Flow) has been proposed as a more flexible generalization of diffusion models. In this paper, we investigate the application of CFM in the context of robotic policy learning and specifically study the interplay with the other design choices required to build an imitation learning algorithm. We show that CFM gives the best performance when combined with point cloud input observations. Additionally, we study the feasibility of a CFM formulation on the SO(3) manifold and evaluate its suitability with a simplified example. We perform extensive experiments on RLBench which demonstrate that our proposed PointFlowMatch approach achieves a state-of-the-art average success rate of 67.8% over eight tasks, double the performance of the next best method.

* Presented at CoRL 2024. Project website at http://pointflowmatch.cs.uni-freiburg.de/

Via

Access Paper or Ask Questions

Imagine2touch: Predictive Tactile Sensing for Robotic Manipulation using Efficient Low-Dimensional Signals

May 02, 2024

Abdallah Ayad, Adrian Röfer, Nick Heppert, Abhinav Valada

Abstract:Humans seemingly incorporate potential touch signals in their perception. Our goal is to equip robots with a similar capability, which we term Imagine2touch. Imagine2touch aims to predict the expected touch signal based on a visual patch representing the area to be touched. We use ReSkin, an inexpensive and compact touch sensor to collect the required dataset through random touching of five basic geometric shapes, and one tool. We train Imagine2touch on two out of those shapes and validate it on the ood. tool. We demonstrate the efficacy of Imagine2touch through its application to the downstream task of object recognition. In this task, we evaluate Imagine2touch performance in two experiments, together comprising 5 out of training distribution objects. Imagine2touch achieves an object recognition accuracy of 58% after ten touches per object, surpassing a proprioception baseline.

* 3 pages, 3 figures, 2 tables, accepted at ViTac2024 ICRA2024 Workshop. arXiv admin note: substantial text overlap with arXiv:2403.15107

Via

Access Paper or Ask Questions

CenterArt: Joint Shape Reconstruction and 6-DoF Grasp Estimation of Articulated Objects

Apr 23, 2024

Sassan Mokhtar, Eugenio Chisari, Nick Heppert, Abhinav Valada

Figure 1 for CenterArt: Joint Shape Reconstruction and 6-DoF Grasp Estimation of Articulated Objects

Figure 2 for CenterArt: Joint Shape Reconstruction and 6-DoF Grasp Estimation of Articulated Objects

Figure 3 for CenterArt: Joint Shape Reconstruction and 6-DoF Grasp Estimation of Articulated Objects

Abstract:Precisely grasping and reconstructing articulated objects is key to enabling general robotic manipulation. In this paper, we propose CenterArt, a novel approach for simultaneous 3D shape reconstruction and 6-DoF grasp estimation of articulated objects. CenterArt takes RGB-D images of the scene as input and first predicts the shape and joint codes through an encoder. The decoder then leverages these codes to reconstruct 3D shapes and estimate 6-DoF grasp poses of the objects. We further develop a mechanism for generating a dataset of 6-DoF grasp ground truth poses for articulated objects. CenterArt is trained on realistic scenes containing multiple articulated objects with randomized designs, textures, lighting conditions, and realistic depths. We perform extensive experiments demonstrating that CenterArt outperforms existing methods in accuracy and robustness.

* 4 pages, 2 figures, accepted to the ICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation

Via

Access Paper or Ask Questions

PseudoTouch: Efficiently Imaging the Surface Feel of Objects for Robotic Manipulation

Mar 22, 2024

Adrian Röfer, Nick Heppert, Abdallah Ayman, Eugenio Chisari, Abhinav Valada

Figure 1 for PseudoTouch: Efficiently Imaging the Surface Feel of Objects for Robotic Manipulation

Figure 2 for PseudoTouch: Efficiently Imaging the Surface Feel of Objects for Robotic Manipulation

Figure 3 for PseudoTouch: Efficiently Imaging the Surface Feel of Objects for Robotic Manipulation

Figure 4 for PseudoTouch: Efficiently Imaging the Surface Feel of Objects for Robotic Manipulation

Abstract:Humans seemingly incorporate potential touch signals in their perception. Our goal is to equip robots with a similar capability, which we term \ourmodel. \ourmodel aims to predict the expected touch signal based on a visual patch representing the touched area. We frame this problem as the task of learning a low-dimensional visual-tactile embedding, wherein we encode a depth patch from which we decode the tactile signal. To accomplish this task, we employ ReSkin, an inexpensive and replaceable magnetic-based tactile sensor. Using ReSkin, we collect and train PseudoTouch on a dataset comprising aligned tactile and visual data pairs obtained through random touching of eight basic geometric shapes. We demonstrate the efficacy of PseudoTouch through its application to two downstream tasks: object recognition and grasp stability prediction. In the object recognition task, we evaluate the learned embedding's performance on a set of five basic geometric shapes and five household objects. Using PseudoTouch, we achieve an object recognition accuracy 84% after just ten touches, surpassing a proprioception baseline. For the grasp stability task, we use ACRONYM labels to train and evaluate a grasp success predictor using PseudoTouch's predictions derived from virtual depth information. Our approach yields an impressive 32% absolute improvement in accuracy compared to the baseline relying on partial point cloud data. We make the data, code, and trained models publicly available at http://pseudotouch.cs.uni-freiburg.de.

* 8 pages, 7 figures, 2 tables, submitted to IROS2024

Via

Access Paper or Ask Questions

DITTO: Demonstration Imitation by Trajectory Transformation

Mar 22, 2024

Nick Heppert, Max Argus, Tim Welschehold, Thomas Brox, Abhinav Valada

Figure 1 for DITTO: Demonstration Imitation by Trajectory Transformation

Figure 2 for DITTO: Demonstration Imitation by Trajectory Transformation

Figure 3 for DITTO: Demonstration Imitation by Trajectory Transformation

Figure 4 for DITTO: Demonstration Imitation by Trajectory Transformation

Abstract:Teaching robots new skills quickly and conveniently is crucial for the broader adoption of robotic systems. In this work, we address the problem of one-shot imitation from a single human demonstration, given by an RGB-D video recording through a two-stage process. In the first stage which is offline, we extract the trajectory of the demonstration. This entails segmenting manipulated objects and determining their relative motion in relation to secondary objects such as containers. Subsequently, in the live online trajectory generation stage, we first \mbox{re-detect} all objects, then we warp the demonstration trajectory to the current scene, and finally, we trace the trajectory with the robot. To complete these steps, our method makes leverages several ancillary models, including those for segmentation, relative object pose estimation, and grasp prediction. We systematically evaluate different combinations of correspondence and re-detection methods to validate our design decision across a diverse range of tasks. Specifically, we collect demonstrations of ten different tasks including pick-and-place tasks as well as articulated object manipulation. Finally, we perform extensive evaluations on a real robot system to demonstrate the effectiveness and utility of our approach in real-world scenarios. We make the code publicly available at http://ditto.cs.uni-freiburg.de.

* 8 pages, 4 figures, 3 tables, submitted to IROS 2024

Via

Access Paper or Ask Questions

CenterGrasp: Object-Aware Implicit Representation Learning for Simultaneous Shape Reconstruction and 6-DoF Grasp Estimation

Dec 13, 2023

Eugenio Chisari, Nick Heppert, Tim Welschehold, Wolfram Burgard, Abhinav Valada

Abstract:Reliable object grasping is a crucial capability for autonomous robots. However, many existing grasping approaches focus on general clutter removal without explicitly modeling objects and thus only relying on the visible local geometry. We introduce CenterGrasp, a novel framework that combines object awareness and holistic grasping. CenterGrasp learns a general object prior by encoding shapes and valid grasps in a continuous latent space. It consists of an RGB-D image encoder that leverages recent advances to detect objects and infer their pose and latent code, and a decoder to predict shape and grasps for each object in the scene. We perform extensive experiments on simulated as well as real-world cluttered scenes and demonstrate strong scene reconstruction and 6-DoF grasp-pose estimation performance. Compared to the state of the art, CenterGrasp achieves an improvement of 38.5 mm in shape reconstruction and 33 percentage points on average in grasp success. We make the code and trained models publicly available at http://centergrasp.cs.uni-freiburg.de.

* Submitted to RA-L. Video, code and models available at http://centergrasp.cs.uni-freiburg.de

Via

Access Paper or Ask Questions

AO-Grasp: Articulated Object Grasp Generation

Oct 24, 2023

Carlota Parés Morlans, Claire Chen, Yijia Weng, Michelle Yi, Yuying Huang, Nick Heppert, Linqi Zhou, Leonidas Guibas, Jeannette Bohg

Abstract:We introduce AO-Grasp, a grasp proposal method that generates stable and actionable 6 degree-of-freedom grasps for articulated objects. Our generated grasps enable robots to interact with articulated objects, such as opening and closing cabinets and appliances. Given a segmented partial point cloud of a single articulated object, AO-Grasp predicts the best grasp points on the object with a novel Actionable Grasp Point Predictor model and then finds corresponding grasp orientations for each point by leveraging a state-of-the-art rigid object grasping method. We train AO-Grasp on our new AO-Grasp Dataset, which contains 48K actionable parallel-jaw grasps on synthetic articulated objects. In simulation, AO-Grasp achieves higher grasp success rates than existing rigid object grasping and articulated object interaction baselines on both train and test categories. Additionally, we evaluate AO-Grasp on 120 realworld scenes of objects with varied geometries, articulation axes, and joint states, where AO-Grasp produces successful grasps on 67.5% of scenes, while the baseline only produces successful grasps on 33.3% of scenes.

* Project website: https://stanford-iprl-lab.github.io/ao-grasp

Via

Access Paper or Ask Questions

CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects

Mar 28, 2023

Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Abhinav Valada, Thomas Kollar

Figure 1 for CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects

Figure 2 for CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects

Figure 3 for CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects

Figure 4 for CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects

Abstract:We present CARTO, a novel approach for reconstructing multiple articulated objects from a single stereo RGB observation. We use implicit object-centric representations and learn a single geometry and articulation decoder for multiple object categories. Despite training on multiple categories, our decoder achieves a comparable reconstruction accuracy to methods that train bespoke decoders separately for each category. Combined with our stereo image encoder we infer the 3D shape, 6D pose, size, joint type, and the joint state of multiple unknown objects in a single forward pass. Our method achieves a 20.4% absolute improvement in mAP 3D IOU50 for novel instances when compared to a two-stage pipeline. Inference time is fast and can run on a NVIDIA TITAN XP GPU at 1 HZ for eight or less objects present. While only trained on simulated data, CARTO transfers to real-world object instances. Code and evaluation data is available at: http://carto.cs.uni-freiburg.de

* 20 pages, 11 figures, accepted at CVPR 2023

Via

Access Paper or Ask Questions