Abstract:Transparent object depth perception poses a challenge in everyday life and logistics, primarily due to the inability of standard 3D sensors to accurately capture depth on transparent or reflective surfaces. This limitation significantly affects depth map and point cloud-reliant applications, especially in robotic manipulation. We developed a vision transformer-based algorithm for stereo depth recovery of transparent objects. This approach is complemented by an innovative feature post-fusion module, which enhances the accuracy of depth recovery by structural features in images. To address the high costs associated with dataset collection for stereo camera-based perception of transparent objects, our method incorporates a parameter-aligned, domain-adaptive, and physically realistic Sim2Real simulation for efficient data generation, accelerated by AI algorithm. Our experimental results demonstrate the model's exceptional Sim2Real generalizability in real-world scenarios, enabling precise depth mapping of transparent objects to assist in robotic manipulation. Project details are available at https://sites.google.com/view/cleardepth/ .
Abstract:We introduce DexGanGrasp, a dexterous grasping synthesis method that generates and evaluates grasps with single view in real time. DexGanGrasp comprises a Conditional Generative Adversarial Networks (cGANs)-based DexGenerator to generate dexterous grasps and a discriminator-like DexEvalautor to assess the stability of these grasps. Extensive simulation and real-world expriments showcases the effectiveness of our proposed method, outperforming the baseline FFHNet with an 18.57% higher success rate in real-world evaluation. We further extend DexGanGrasp to DexAfford-Prompt, an open-vocabulary affordance grounding pipeline for dexterous grasping leveraging Multimodal Large Language Models (MLLMs) and Vision Language Models (VLMs), to achieve task-oriented grasping with successful real-world deployments.
Abstract:Synthesizing diverse and accurate grasps with multi-fingered hands is an important yet challenging task in robotics. Previous efforts focusing on generative modeling have fallen short of precisely capturing the multi-modal, high-dimensional grasp distribution. To address this, we propose exploiting a special kind of Deep Generative Model (DGM) based on Normalizing Flows (NFs), an expressive model for learning complex probability distributions. Specifically, we first observed an encouraging improvement in diversity by directly applying a single conditional NFs (cNFs), dubbed FFHFlow-cnf, to learn a grasp distribution conditioned on the incomplete point cloud. However, we also recognized limited performance gains due to restricted expressivity in the latent space. This motivated us to develop a novel flow-based d Deep Latent Variable Model (DLVM), namely FFHFlow-lvm, which facilitates more reasonable latent features, leading to both diverse and accurate grasp synthesis for unseen objects. Unlike Variational Autoencoders (VAEs), the proposed DLVM counteracts typical pitfalls such as mode collapse and mis-specified priors by leveraging two cNFs for the prior and likelihood distributions, which are usually restricted to being isotropic Gaussian. Comprehensive experiments in simulation and real-robot scenarios demonstrate that our method generates more accurate and diverse grasps than the VAE baselines. Additionally, a run-time comparison is conducted to reveal its high potential for real-time applications.
Abstract:Despite the substantial progress in deep learning, its adoption in industrial robotics projects remains limited, primarily due to challenges in data acquisition and labeling. Previous sim2real approaches using domain randomization require extensive scene and model optimization. To address these issues, we introduce an innovative physically-based structured light simulation system, generating both RGB and physically realistic depth images, surpassing previous dataset generation tools. We create an RGBD dataset tailored for robotic industrial grasping scenarios and evaluate it across various tasks, including object detection, instance segmentation, and embedding sim2real visual perception in industrial robotic grasping. By reducing the sim2real gap and enhancing deep learning training, we facilitate the application of deep learning models in industrial settings. Project details are available at https://baikaixinpublic.github.io/structured light 3D synthesizer/.
Abstract:The integration of optimization method and generative models has significantly advanced dexterous manipulation techniques for five-fingered hand grasping. Yet, the application of these techniques in cluttered environments is a relatively unexplored area. To address this research gap, we have developed a novel method for generating five-fingered hand grasp samples in cluttered settings. This method emphasizes simulated grasp quality and the nuanced interaction between the hand and surrounding objects. A key aspect of our approach is our data generation method, capable of estimating contact spatial and semantic representations and affordance grasps based on object affordance information. Furthermore, our Contact Semantic Conditional Variational Autoencoder (CoSe-CVAE) network is adept at creating comprehensive contact maps from point clouds, incorporating both spatial and semantic data. We introduce a unique grasp detection technique that efficiently formulates mechanical hand grasp poses from these maps. Additionally, our evaluation model is designed to assess grasp quality and collision probability, significantly improving the practicality of five-fingered hand grasping in complex scenarios. Our data generation method outperforms previous datasets in grasp diversity, scene diversity, modality diversity. Our grasp generation method has demonstrated remarkable success, outperforming established baselines with 81.0% average success rate in real-world single-object grasping and 75.3% success rate in multi-object grasping. The dataset and supplementary materials can be found at https://sites.google.com/view/ffh-clutteredgrasping, and we will release the code upon publication.
Abstract:The exploration of robotic dexterous hands utilizing tools has recently attracted considerable attention. A significant challenge in this field is the precise awareness of a tool's pose when grasped, as occlusion by the hand often degrades the quality of the estimation. Additionally, the tool's overall pose often fails to accurately represent the contact interaction, thereby limiting the effectiveness of vision-guided, contact-dependent activities. To overcome this limitation, we present the innovative TOOLEE dataset, which, to the best of our knowledge, is the first to feature affordance segmentation of a tool's end-effector (EE) along with its defined 6D pose based on its usage. Furthermore, we propose the ToolEENet framework for accurate 6D pose estimation of the tool's EE. This framework begins by segmenting the tool's EE from raw RGBD data, then uses a diffusion model-based pose estimator for 6D pose estimation at a category-specific level. Addressing the issue of symmetry in pose estimation, we introduce a symmetry-aware pose representation that enhances the consistency of pose estimation. Our approach excels in this field, demonstrating high levels of precision and generalization. Furthermore, it shows great promise for application in contact-based manipulation scenarios. All data and codes are available on the project website: https://yuyangtu.github.io/projectToolEENet.html
Abstract:We introduce a Cable Grasping-Convolutional Neural Network designed to facilitate robust cable grasping in cluttered environments. Utilizing physics simulations, we generate an extensive dataset that mimics the intricacies of cable grasping, factoring in potential collisions between cables and robotic grippers. We employ the Approximate Convex Decomposition technique to dissect the non-convex cable model, with grasp quality autonomously labeled based on simulated grasping attempts. The CG-CNN is refined using this simulated dataset and enhanced through domain randomization techniques. Subsequently, the trained model predicts grasp quality, guiding the optimal grasp pose to the robot controller for execution. Grasping efficacy is assessed across both synthetic and real-world settings. Given our model implicit collision sensitivity, we achieved commendable success rates of 92.3% for known cables and 88.4% for unknown cables, surpassing contemporary state-of-the-art approaches. Supplementary materials can be found at https://leizhang-public.github.io/cg-cnn/ .
Abstract:Traditional visual servoing methods suffer from serving between scenes from multiple perspectives, which humans can complete with visual signals alone. In this paper, we investigated how multi-perspective visual servoing could be solved under robot-specific constraints, including self-collision, singularity problems. We presented a novel learning-based multi-perspective visual servoing framework, which iteratively estimates robot actions from latent space representations of visual states using reinforcement learning. Furthermore, our approaches were trained and validated in a Gazebo simulation environment with connection to OpenAI/Gym. Through simulation experiments, we showed that our method can successfully learn an optimal control policy given initial images from different perspectives, and it outperformed the Direct Visual Servoing algorithm with mean success rate of 97.0%.
Abstract:An important prerequisite for autonomous robots is their ability to reliably grasp a wide variety of objects. Most state-of-the-art systems employ specialized or simple end-effectors, such as two-jaw grippers, which severely limit the range of objects to manipulate. Additionally, they conventionally require a structured and fully predictable environment while the vast majority of our world is complex, unstructured, and dynamic. This paper presents an implementation to overcome both issues. Firstly, the integration of a five-finger hand enhances the variety of possible grasps and manipulable objects. This kinematically complex end-effector is controlled by a deep learning based generative grasping network. The required virtual model of the unknown target object is iteratively completed by processing visual sensor data. Secondly, this visual feedback is employed to realize closed-loop servo control which compensates for external disturbances. Our experiments on real hardware confirm the system's capability to reliably grasp unknown dynamic target objects without a priori knowledge of their trajectories. To the best of our knowledge, this is the first method to achieve dynamic multi-fingered grasping for unknown objects. A video of the experiments is available at https://youtu.be/Ut28yM1gnvI.
Abstract:Precise robotic grasping of several novel objects is a huge challenge in manufacturing, automation, and logistics. Most of the current methods for model-free grasping are disadvantaged by the sparse data in grasping datasets and by errors in sensor data and contact models. This study combines data generation and sim-to-real transfer learning in a grasping framework that reduces the sim-to-real gap and enables precise and reliable model-free grasping. A large-scale robotic grasping dataset with dense grasp labels is generated using domain randomization methods and a novel data augmentation method for deep learning-based robotic grasping to solve data sparse problem. We present an end-to-end robotic grasping network with a grasp optimizer. The grasp policies are trained with sim-to-real transfer learning. The presented results suggest that our grasping framework reduces the uncertainties in grasping datasets, sensor data, and contact models. In physical robotic experiments, our grasping framework grasped single known objects and novel complex-shaped household objects with a success rate of 90.91%. In a complex scenario with multi-objects robotic grasping, the success rate was 85.71%. The proposed grasping framework outperformed two state-of-the-art methods in both known and unknown object robotic grasping.