Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Minzhao Zhu

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Oct 08, 2024

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang(+2 more)

Figure 1 for GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Figure 2 for GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Figure 3 for GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Figure 4 for GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Abstract:We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world. This large-scale pre-training, involving 38 million video clips and over 50 billion tokens, equips GR-2 with the ability to generalize across a wide range of robotic tasks and environments during subsequent policy learning. Following this, GR-2 is fine-tuned for both video generation and action prediction using robot trajectories. It exhibits impressive multi-task learning capabilities, achieving an average success rate of 97.7% across more than 100 tasks. Moreover, GR-2 demonstrates exceptional generalization to new, previously unseen scenarios, including novel backgrounds, environments, objects, and tasks. Notably, GR-2 scales effectively with model size, underscoring its potential for continued growth and application. Project page: \url{https://gr2-manipulation.github.io}.

* Tech Report. Authors are listed in alphabetical order. Project page: https://gr2-manipulation.github.io

Via

Access Paper or Ask Questions

ClickSeg: 3D Instance Segmentation with Click-Level Weak Annotations

Jul 19, 2023

Leyao Liu, Tao Kong, Minzhao Zhu, Jiashuo Fan, Lu Fang

Abstract:3D instance segmentation methods often require fully-annotated dense labels for training, which are costly to obtain. In this paper, we present ClickSeg, a novel click-level weakly supervised 3D instance segmentation method that requires one point per instance annotation merely. Such a problem is very challenging due to the extremely limited labels, which has rarely been solved before. We first develop a baseline weakly-supervised training method, which generates pseudo labels for unlabeled data by the model itself. To utilize the property of click-level annotation setting, we further propose a new training framework. Instead of directly using the model inference way, i.e., mean-shift clustering, to generate the pseudo labels, we propose to use k-means with fixed initial seeds: the annotated points. New similarity metrics are further designed for clustering. Experiments on ScanNetV2 and S3DIS datasets show that the proposed ClickSeg surpasses the previous best weakly supervised instance segmentation result by a large margin (e.g., +9.4% mAP on ScanNetV2). Using 0.02% supervision signals merely, ClickSeg achieves $\sim$90% of the accuracy of the fully-supervised counterpart. Meanwhile, it also achieves state-of-the-art semantic segmentation results among weakly supervised methods that use the same annotation settings.

Via

Access Paper or Ask Questions

Navigating to Objects in Unseen Environments by Distance Prediction

Feb 08, 2022

Minzhao Zhu, Binglei Zhao, Tao Kong

Figure 1 for Navigating to Objects in Unseen Environments by Distance Prediction

Figure 2 for Navigating to Objects in Unseen Environments by Distance Prediction

Figure 3 for Navigating to Objects in Unseen Environments by Distance Prediction

Figure 4 for Navigating to Objects in Unseen Environments by Distance Prediction

Abstract:Object Goal Navigation (ObjectNav) task is to navigate an agent to an object instance in unseen environments. The traditional navigation paradigm plans the shortest path on a pre-built map. Inspired by this, we propose an object goal navigation framework, which could directly perform path planning based on an estimated distance map. Specifically, our model takes a birds-eye-view semantic map as input, and estimates the distance from the map cells to the target object based on the learned prior knowledge. With the estimated distance map, the agent could explore the environment and navigate to the target objects based on either human-designed or learned navigation policy. Empirical results in visually realistic simulation environments show that the proposed method outperforms a wide range of baselines on success rate and efficiency.

* Tech report

Via

Access Paper or Ask Questions