Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junhao Cai

Generative Artificial Intelligence in Robotic Manipulation: A Survey

Mar 05, 2025

Kun Zhang, Peng Yun, Jun Cen, Junhao Cai, Didi Zhu, Hangjie Yuan, Chao Zhao, Tao Feng, Michael Yu Wang, Qifeng Chen(+3 more)

Figure 1 for Generative Artificial Intelligence in Robotic Manipulation: A Survey

Figure 2 for Generative Artificial Intelligence in Robotic Manipulation: A Survey

Figure 3 for Generative Artificial Intelligence in Robotic Manipulation: A Survey

Figure 4 for Generative Artificial Intelligence in Robotic Manipulation: A Survey

Abstract:This survey provides a comprehensive review on recent advancements of generative learning models in robotic manipulation, addressing key challenges in the field. Robotic manipulation faces critical bottlenecks, including significant challenges in insufficient data and inefficient data acquisition, long-horizon and complex task planning, and the multi-modality reasoning ability for robust policy learning performance across diverse environments. To tackle these challenges, this survey introduces several generative model paradigms, including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), diffusion models, probabilistic flow models, and autoregressive models, highlighting their strengths and limitations. The applications of these models are categorized into three hierarchical layers: the Foundation Layer, focusing on data generation and reward generation; the Intermediate Layer, covering language, code, visual, and state generation; and the Policy Layer, emphasizing grasp generation and trajectory generation. Each layer is explored in detail, along with notable works that have advanced the state of the art. Finally, the survey outlines future research directions and challenges, emphasizing the need for improved efficiency in data utilization, better handling of long-horizon tasks, and enhanced generalization across diverse robotic scenarios. All the related resources, including research papers, open-source data, and projects, are collected for the community in https://github.com/GAI4Manipulation/AwesomeGAIManipulation

Via

Access Paper or Ask Questions

Toward Scalable and Efficient Visual Data Transmission in 6G Networks

Sep 24, 2024

Junhao Cai, Taegun An, Changhee Joo

Figure 1 for Toward Scalable and Efficient Visual Data Transmission in 6G Networks

Abstract:6G network technology will emerge in a landscape where visual data transmissions dominate global mobile traffic and are expected to grow continuously, driven by the increasing demand for AI-based computer vision applications. This will make already challenging task of visual data transmission even more difficult. In this work, we review effective techniques for visual data transmission, such as content compression and adaptive video streaming, highlighting their advantages and limitations. Further, considering the scalability and cost issues of cloud-based and on-device AI services, we explore distributed in-network computing architecture like fog-computing as a direction of 6G networks, and investigate the necessary technical properties for the timely delivery of visual data.

Via

Access Paper or Ask Questions

Gaussian-Informed Continuum for Physical Property Identification and Simulation

Jun 21, 2024

Junhao Cai, Yuji Yang, Weihao Yuan, Yisheng He, Zilong Dong, Liefeng Bo, Hui Cheng, Qifeng Chen

Abstract:This paper studies the problem of estimating physical properties (system identification) through visual observations. To facilitate geometry-aware guidance in physical property estimation, we introduce a novel hybrid framework that leverages 3D Gaussian representation to not only capture explicit shapes but also enable the simulated continuum to deduce implicit shapes during training. We propose a new dynamic 3D Gaussian framework based on motion factorization to recover the object as 3D Gaussian point sets across different time states. Furthermore, we develop a coarse-to-fine filling strategy to generate the density fields of the object from the Gaussian reconstruction, allowing for the extraction of object continuums along with their surfaces and the integration of Gaussian attributes into these continuums. In addition to the extracted object surfaces, the Gaussian-informed continuum also enables the rendering of object masks during simulations, serving as implicit shape guidance for physical property estimation. Extensive experimental evaluations demonstrate that our pipeline achieves state-of-the-art performance across multiple benchmarks and metrics. Additionally, we illustrate the effectiveness of the proposed method through real-world demonstrations, showcasing its practical utility. Our project page is at https://jukgei.github.io/project/gic.

* 19 pages, 8 figures

Via

Access Paper or Ask Questions

IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images

Mar 30, 2024

Yushuang Wu, Luyue Shi, Junhao Cai, Weihao Yuan, Lingteng Qiu, Zilong Dong, Liefeng Bo, Shuguang Cui, Xiaoguang Han

Figure 1 for IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images

Figure 2 for IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images

Figure 3 for IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images

Figure 4 for IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images

Abstract:Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task, particularly with real-world data. Current state-of-the-art methods develop Transformer-based implicit field learning, necessitating an intensive learning paradigm that requires dense query-supervision uniformly sampled throughout the entire space. We propose a novel approach, IPoD, which harmonizes implicit field learning with point diffusion. This approach treats the query points for implicit field learning as a noisy point cloud for iterative denoising, allowing for their dynamic adaptation to the target object shape. Such adaptive query points harness diffusion learning's capability for coarse shape recovery and also enhances the implicit representation's ability to delineate finer details. Besides, an additional self-conditioning mechanism is designed to use implicit predictions as the guidance of diffusion learning, leading to a cooperative system. Experiments conducted on the CO3D-v2 dataset affirm the superiority of IPoD, achieving 7.8% improvement in F-score and 28.6% in Chamfer distance over existing methods. The generalizability of IPoD is also demonstrated on the MVImgNet dataset. Our project page is at https://yushuang-wu.github.io/IPoD.

* CVPR 2024

Via

Access Paper or Ask Questions

OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation

Mar 19, 2024

Junhao Cai, Yisheng He, Weihao Yuan, Siyu Zhu, Zilong Dong, Liefeng Bo, Qifeng Chen

Figure 1 for OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation

Figure 2 for OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation

Figure 3 for OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation

Figure 4 for OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation

Abstract:This paper studies a new open-set problem, the open-vocabulary category-level object pose and size estimation. Given human text descriptions of arbitrary novel object categories, the robot agent seeks to predict the position, orientation, and size of the target object in the observed scene image. To enable such generalizability, we first introduce OO3D-9D, a large-scale photorealistic dataset for this task. Derived from OmniObject3D, OO3D-9D is the largest and most diverse dataset in the field of category-level object pose and size estimation. It includes additional annotations for the symmetry axis of each category, which help resolve symmetric ambiguity. Apart from the large-scale dataset, we find another key to enabling such generalizability is leveraging the strong prior knowledge in pre-trained visual-language foundation models. We then propose a framework built on pre-trained DinoV2 and text-to-image stable diffusion models to infer the normalized object coordinate space (NOCS) maps of the target instances. This framework fully leverages the visual semantic prior from DinoV2 and the aligned visual and language knowledge within the text-to-image diffusion model, which enables generalization to various text descriptions of novel categories. Comprehensive quantitative and qualitative experiments demonstrate that the proposed open-vocabulary method, trained on our large-scale synthesized data, significantly outperforms the baseline and can effectively generalize to real-world images of unseen categories. The project page is at https://ov9d.github.io.

Via

Access Paper or Ask Questions

ERRA: An Embodied Representation and Reasoning Architecture for Long-horizon Language-conditioned Manipulation Tasks

Apr 05, 2023

Chao Zhao, Shuai Yuan, Chunli Jiang, Junhao Cai, Hongyu Yu, Michael Yu Wang, Qifeng Chen

Abstract:This letter introduces ERRA, an embodied learning architecture that enables robots to jointly obtain three fundamental capabilities (reasoning, planning, and interaction) for solving long-horizon language-conditioned manipulation tasks. ERRA is based on tightly-coupled probabilistic inferences at two granularity levels. Coarse-resolution inference is formulated as sequence generation through a large language model, which infers action language from natural language instruction and environment state. The robot then zooms to the fine-resolution inference part to perform the concrete action corresponding to the action language. Fine-resolution inference is constructed as a Markov decision process, which takes action language and environmental sensing as observations and outputs the action. The results of action execution in environments provide feedback for subsequent coarse-resolution reasoning. Such coarse-to-fine inference allows the robot to decompose and achieve long-horizon tasks interactively. In extensive experiments, we show that ERRA can complete various long-horizon manipulation tasks specified by abstract language instructions. We also demonstrate successful generalization to the novel but similar natural language instructions.

* Accepted to IEEE Robotics and Automation Letters (RA-L)

Via

Access Paper or Ask Questions

Flipbot: Learning Continuous Paper Flipping via Coarse-to-Fine Exteroceptive-Proprioceptive Exploration

Apr 05, 2023

Chao Zhao, Chunli Jiang, Junhao Cai, Michael Yu Wang, Hongyu Yu, Qifeng Chen

Abstract:This paper tackles the task of singulating and grasping paper-like deformable objects. We refer to such tasks as paper-flipping. In contrast to manipulating deformable objects that lack compression strength (such as shirts and ropes), minor variations in the physical properties of the paper-like deformable objects significantly impact the results, making manipulation highly challenging. Here, we present Flipbot, a novel solution for flipping paper-like deformable objects. Flipbot allows the robot to capture object physical properties by integrating exteroceptive and proprioceptive perceptions that are indispensable for manipulating deformable objects. Furthermore, by incorporating a proposed coarse-to-fine exploration process, the system is capable of learning the optimal control parameters for effective paper-flipping through proprioceptive and exteroceptive inputs. We deploy our method on a real-world robot with a soft gripper and learn in a self-supervised manner. The resulting policy demonstrates the effectiveness of Flipbot on paper-flipping tasks with various settings beyond the reach of prior studies, including but not limited to flipping pages throughout a book and emptying paper sheets in a box.

* Accepted to International Conference on Robotics and Automation (ICRA) 2023

Via

Access Paper or Ask Questions

Learn to Grasp via Intention Discovery and its Application to Challenging Clutter

Apr 05, 2023

Chao Zhao, Chunli Jiang, Junhao Cai, Hongyu Yu, Michael Yu Wang, Qifeng Chen

Figure 1 for Learn to Grasp via Intention Discovery and its Application to Challenging Clutter

Figure 2 for Learn to Grasp via Intention Discovery and its Application to Challenging Clutter

Figure 3 for Learn to Grasp via Intention Discovery and its Application to Challenging Clutter

Figure 4 for Learn to Grasp via Intention Discovery and its Application to Challenging Clutter

Abstract:Humans excel in grasping objects through diverse and robust policies, many of which are so probabilistically rare that exploration-based learning methods hardly observe and learn. Inspired by the human learning process, we propose a method to extract and exploit latent intents from demonstrations, and then learn diverse and robust grasping policies through self-exploration. The resulting policy can grasp challenging objects in various environments with an off-the-shelf parallel gripper. The key component is a learned intention estimator, which maps gripper pose and visual sensory to a set of sub-intents covering important phases of the grasping movement. Sub-intents can be used to build an intrinsic reward to guide policy learning. The learned policy demonstrates remarkable zero-shot generalization from simulation to the real world while retaining its robustness against states that have never been encountered during training, novel objects such as protractors and user manuals, and environments such as the cluttered conveyor.

* IEEE Robotics and Automation Letters, vol. 8, no. 2, pp. 488-495, Feb. 2023
* Accepted to IEEE Robotics and Automation Letters (RA-L)

Via

Access Paper or Ask Questions

Volumetric-based Contact Point Detection for 7-DoF Grasping

Sep 14, 2022

Junhao Cai, Jingcheng Su, Zida Zhou, Hui Cheng, Qifeng Chen, Michael Y Wang

Figure 1 for Volumetric-based Contact Point Detection for 7-DoF Grasping

Figure 2 for Volumetric-based Contact Point Detection for 7-DoF Grasping

Figure 3 for Volumetric-based Contact Point Detection for 7-DoF Grasping

Figure 4 for Volumetric-based Contact Point Detection for 7-DoF Grasping

Abstract:In this paper, we propose a novel grasp pipeline based on contact point detection on the truncated signed distance function (TSDF) volume to achieve closed-loop 7-degree-of-freedom (7-DoF) grasping on cluttered environments. The key aspects of our method are that 1) the proposed pipeline exploits the TSDF volume in terms of multi-view fusion, contact-point sampling and evaluation, and collision checking, which provides reliable and collision-free 7-DoF gripper poses with real-time performance; 2) the contact-based pose representation effectively eliminates the ambiguity introduced by the normal-based methods, which provides a more precise and flexible solution. Extensive simulated and real-robot experiments demonstrate that the proposed pipeline can select more antipodal and stable grasp poses and outperforms normal-based baselines in terms of the grasp success rate in both simulated and physical scenarios.

* Accepted to Conference on Robot Learning (CoRL) 2022. Supplementary materials: https://openreview.net/forum?id=SrSCqW4dq9

Via

Access Paper or Ask Questions

Open-world Semantic Segmentation for LIDAR Point Clouds

Jul 04, 2022

Jun Cen, Peng Yun, Shiwei Zhang, Junhao Cai, Di Luan, Michael Yu Wang, Ming Liu, Mingqian Tang

Figure 1 for Open-world Semantic Segmentation for LIDAR Point Clouds

Figure 2 for Open-world Semantic Segmentation for LIDAR Point Clouds

Figure 3 for Open-world Semantic Segmentation for LIDAR Point Clouds

Figure 4 for Open-world Semantic Segmentation for LIDAR Point Clouds

Abstract:Current methods for LIDAR semantic segmentation are not robust enough for real-world applications, e.g., autonomous driving, since it is closed-set and static. The closed-set assumption makes the network only able to output labels of trained classes, even for objects never seen before, while a static network cannot update its knowledge base according to what it has seen. Therefore, in this work, we propose the open-world semantic segmentation task for LIDAR point clouds, which aims to 1) identify both old and novel classes using open-set semantic segmentation, and 2) gradually incorporate novel objects into the existing knowledge base using incremental learning without forgetting old classes. For this purpose, we propose a REdundAncy cLassifier (REAL) framework to provide a general architecture for both the open-set semantic segmentation and incremental learning problems. The experimental results show that REAL can simultaneously achieves state-of-the-art performance in the open-set semantic segmentation task on the SemanticKITTI and nuScenes datasets, and alleviate the catastrophic forgetting problem with a large margin during incremental learning.

* Accepted by ECCV 2022. arXiv admin note: text overlap with arXiv:2011.10033, arXiv:2109.05441 by other authors

Via

Access Paper or Ask Questions