Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mi Yan

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

May 06, 2025

Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Heming Cui(+2 more)

Abstract:Embodied foundation models are gaining increasing attention for their zero-shot generalization, scalability, and adaptability to new tasks through few-shot post-training. However, existing models rely heavily on real-world data, which is costly and labor-intensive to collect. Synthetic data offers a cost-effective alternative, yet its potential remains largely underexplored. To bridge this gap, we explore the feasibility of training Vision-Language-Action models entirely with large-scale synthetic action data. We curate SynGrasp-1B, a billion-frame robotic grasping dataset generated in simulation with photorealistic rendering and extensive domain randomization. Building on this, we present GraspVLA, a VLA model pretrained on large-scale synthetic action data as a foundational model for grasping tasks. GraspVLA integrates autoregressive perception tasks and flow-matching-based action generation into a unified Chain-of-Thought process, enabling joint training on synthetic action data and Internet semantics data. This design helps mitigate sim-to-real gaps and facilitates the transfer of learned actions to a broader range of Internet-covered objects, achieving open-vocabulary generalization in grasping. Extensive evaluations across real-world and simulation benchmarks demonstrate GraspVLA's advanced zero-shot generalizability and few-shot adaptability to specific human preferences. We will release SynGrasp-1B dataset and pre-trained weights to benefit the community.

Via

Access Paper or Ask Questions

MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

Jan 15, 2024

Mi Yan, Jiazhao Zhang, Yan Zhu, He Wang

Abstract:Open-vocabulary 3D instance segmentation has emerged as a frontier topic due to its capability to segment 3D instances beyond a predefined set of categories. However, compared to significant progress in the 2D domain, methods for 3D open-vocabulary instance segmentation are hindered by the limited scale of high-quality annotated 3D data. To harness the capabilities of 2D models, recent efforts have focused on merging 2D masks based on metrics such as geometric and semantic similarity to form 3D instances. In contrast to these local metrics, we propose a novel metric called view consensus to better exploit multi-view observation. The key insight is that two 2D masks should be considered as belonging to the same instance if a considerable number of other 2D masks from other views contain both these two masks. Based on this metric, we build a global mask graph and iteratively cluster masks, prioritizing mask pairs with solid view consensus. The corresponding 3D points cluster of these 2D mask clusters can be regarded as 3D instances, along with the fused open-vocabulary features from clustered 2D masks. Through this multi-view verification and fusion mechanism, our method effectively leverages the prior instance knowledge from massive 2D masks predicted by visual foundation models, eliminating the need for training on 3D data. Experiments on publicly available datasets, including ScanNet200 and MatterPort3D, demonstrate that our method achieves state-of-the-art performance in both open-vocabulary instance segmentation and class-agnostic mask generation. Our project page is at https://pku-epic.github.io/MaskClustering.

Via

Access Paper or Ask Questions

Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild

Sep 24, 2022

Jiayi Chen, Mi Yan, Jiazhao Zhang, Yinzhen Xu, Xiaolong Li, Yijia Weng, Li Yi, Shuran Song, He Wang

Figure 1 for Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild

Figure 2 for Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild

Figure 3 for Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild

Figure 4 for Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild

Abstract:In this work, we tackle the challenging task of jointly tracking hand object pose and reconstructing their shapes from depth point cloud sequences in the wild, given the initial poses at frame 0. We for the first time propose a point cloud based hand joint tracking network, HandTrackNet, to estimate the inter-frame hand joint motion. Our HandTrackNet proposes a novel hand pose canonicalization module to ease the tracking task, yielding accurate and robust hand joint tracking. Our pipeline then reconstructs the full hand via converting the predicted hand joints into a template-based parametric hand model MANO. For object tracking, we devise a simple yet effective module that estimates the object SDF from the first frame and performs optimization-based tracking. Finally, a joint optimization step is adopted to perform joint hand and object reasoning, which alleviates the occlusion-induced ambiguity and further refines the hand pose. During training, the whole pipeline only sees purely synthetic data, which are synthesized with sufficient variations and by depth simulation for the ease of generalization. The whole pipeline is pertinent to the generalization gaps and thus directly transferable to real in-the-wild data. We evaluate our method on two real hand object interaction datasets, e.g. HO3D and DexYCB, without any finetuning. Our experiments demonstrate that the proposed method significantly outperforms the previous state-of-the-art depth-based hand and object pose estimation and tracking methods, running at a frame rate of 9 FPS.

Via

Access Paper or Ask Questions

Domain Adaptation on Point Clouds via Geometry-Aware Implicits

Dec 17, 2021

Yuefan Shen, Yanchao Yang, Mi Yan, He Wang, Youyi Zheng, Leonidas Guibas

Figure 1 for Domain Adaptation on Point Clouds via Geometry-Aware Implicits

Figure 2 for Domain Adaptation on Point Clouds via Geometry-Aware Implicits

Figure 3 for Domain Adaptation on Point Clouds via Geometry-Aware Implicits

Figure 4 for Domain Adaptation on Point Clouds via Geometry-Aware Implicits

Abstract:As a popular geometric representation, point clouds have attracted much attention in 3D vision, leading to many applications in autonomous driving and robotics. One important yet unsolved issue for learning on point cloud is that point clouds of the same object can have significant geometric variations if generated using different procedures or captured using different sensors. These inconsistencies induce domain gaps such that neural networks trained on one domain may fail to generalize on others. A typical technique to reduce the domain gap is to perform adversarial training so that point clouds in the feature space can align. However, adversarial training is easy to fall into degenerated local minima, resulting in negative adaptation gains. Here we propose a simple yet effective method for unsupervised domain adaptation on point clouds by employing a self-supervised task of learning geometry-aware implicits, which plays two critical roles in one shot. First, the geometric information in the point clouds is preserved through the implicit representations for downstream tasks. More importantly, the domain-specific variations can be effectively learned away in the implicit space. We also propose an adaptive strategy to compute unsigned distance fields for arbitrary point clouds due to the lack of shape models in practice. When combined with a task loss, the proposed outperforms state-of-the-art unsupervised domain adaptation methods that rely on adversarial domain alignment and more complicated self-supervised tasks. Our method is evaluated on both PointDA-10 and GraspNet datasets. The code and trained models will be publicly available.

Via

Access Paper or Ask Questions

Machine Learning and the Internet of Things Enable Steam Flood Optimization for Improved Oil Production

Aug 30, 2019

Mi Yan, Jonathan C. MacDonald, Chris T. Reaume, Wesley Cobb, Tamas Toth, Sarah S. Karthigan

Figure 1 for Machine Learning and the Internet of Things Enable Steam Flood Optimization for Improved Oil Production

Figure 2 for Machine Learning and the Internet of Things Enable Steam Flood Optimization for Improved Oil Production

Figure 3 for Machine Learning and the Internet of Things Enable Steam Flood Optimization for Improved Oil Production

Figure 4 for Machine Learning and the Internet of Things Enable Steam Flood Optimization for Improved Oil Production

Abstract:Recently developed machine learning techniques, in association with the Internet of Things (IoT) allow for the implementation of a method of increasing oil production from heavy-oil wells. Steam flood injection, a widely used enhanced oil recovery technique, uses thermal and gravitational potential to mobilize and dilute heavy oil in situ to increase oil production. In contrast to traditional steam flood simulations based on principles of classic physics, we introduce here an approach using cutting-edge machine learning techniques that have the potential to provide a better way to describe the performance of steam flood. We propose a workflow to address a category of time-series data that can be analyzed with supervised machine learning algorithms and IoT. We demonstrate the effectiveness of the technique for forecasting oil production in steam flood scenarios. Moreover, we build an optimization system that recommends an optimal steam allocation plan, and show that it leads to a 3% improvement in oil production. We develop a minimum viable product on a cloud platform that can implement real-time data collection, transfer, and storage, as well as the training and implementation of a cloud-based machine learning model. This workflow also offers an applicable solution to other problems with similar time-series data structures, like predictive maintenance.

* The 1st International Workshop on Artificial Intelligence of Things at KDD 2019
* Accepted by the 1st International Workshop on Artificial Intelligence of Things at KDD 2019

Via

Access Paper or Ask Questions