Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianglong Ye

Co-Design of Soft Gripper with Neural Physics

May 26, 2025

Sha Yi, Xueqian Bai, Adabhav Singh, Jianglong Ye, Michael T Tolley, Xiaolong Wang

Abstract:For robot manipulation, both the controller and end-effector design are crucial. Soft grippers are generalizable by deforming to different geometries, but designing such a gripper and finding its grasp pose remains challenging. In this paper, we propose a co-design framework that generates an optimized soft gripper's block-wise stiffness distribution and its grasping pose, using a neural physics model trained in simulation. We derived a uniform-pressure tendon model for a flexure-based soft finger, then generated a diverse dataset by randomizing both gripper pose and design parameters. A neural network is trained to approximate this forward simulation, yielding a fast, differentiable surrogate. We embed that surrogate in an end-to-end optimization loop to optimize the ideal stiffness configuration and best grasp pose. Finally, we 3D-print the optimized grippers of various stiffness by changing the structural parameters. We demonstrate that our co-designed grippers significantly outperform baseline designs in both simulation and hardware experiments.

Via

Access Paper or Ask Questions

M3: 3D-Spatial MultiModal Memory

Mar 20, 2025

Xueyan Zou, Yuchen Song, Ri-Zhao Qiu, Xuanbin Peng, Jianglong Ye, Sifei Liu, Xiaolong Wang

Abstract:We present 3D Spatial MultiModal Memory (M3), a multimodal memory system designed to retain information about medium-sized static scenes through video sources for visual perception. By integrating 3D Gaussian Splatting techniques with foundation models, M3 builds a multimodal memory capable of rendering feature representations across granularities, encompassing a wide range of knowledge. In our exploration, we identify two key challenges in previous works on feature splatting: (1) computational constraints in storing high-dimensional features for each Gaussian primitive, and (2) misalignment or information loss between distilled features and foundation model features. To address these challenges, we propose M3 with key components of principal scene components and Gaussian memory attention, enabling efficient training and inference. To validate M3, we conduct comprehensive quantitative evaluations of feature similarity and downstream tasks, as well as qualitative visualizations to highlight the pixel trace of Gaussian memory attention. Our approach encompasses a diverse range of foundation models, including vision-language models (VLMs), perception models, and large multimodal and language models (LMMs/LLMs). Furthermore, to demonstrate real-world applicability, we deploy M3's feature field in indoor scenes on a quadruped robot. Notably, we claim that M3 is the first work to address the core compression challenges in 3D feature distillation.

* ICLR2025 homepage: https://m3-spatial-memory.github.io code: https://github.com/MaureenZOU/m3-spatial

Via

Access Paper or Ask Questions

Learning Generalizable Feature Fields for Mobile Manipulation

Mar 12, 2024

Ri-Zhao Qiu, Yafei Hu, Ge Yang, Yuchen Song, Yang Fu, Jianglong Ye, Jiteng Mu, Ruihan Yang, Nikolay Atanasov, Sebastian Scherer(+1 more)

Abstract:An open problem in mobile manipulation is how to represent objects and scenes in a unified manner, so that robots can use it both for navigating in the environment and manipulating objects. The latter requires capturing intricate geometry while understanding fine-grained semantics, whereas the former involves capturing the complexity inherit to an expansive physical scale. In this work, we present GeFF (Generalizable Feature Fields), a scene-level generalizable neural feature field that acts as a unified representation for both navigation and manipulation that performs in real-time. To do so, we treat generative novel view synthesis as a pre-training task, and then align the resulting rich scene priors with natural language via CLIP feature distillation. We demonstrate the effectiveness of this approach by deploying GeFF on a quadrupedal robot equipped with a manipulator. We evaluate GeFF's ability to generalize to open-set objects as well as running time, when performing open-vocabulary mobile manipulation in dynamic scenes.

* Preprint. Project website is at: https://geff-b1.github.io/

Via

Access Paper or Ask Questions

Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models

Oct 04, 2023

Jianglong Ye, Peng Wang, Kejie Li, Yichun Shi, Heng Wang

Figure 1 for Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models

Figure 2 for Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models

Figure 3 for Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models

Figure 4 for Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models

Abstract:Zero-shot novel view synthesis (NVS) from a single image is an essential problem in 3D object understanding. While recent approaches that leverage pre-trained generative models can synthesize high-quality novel views from in-the-wild inputs, they still struggle to maintain 3D consistency across different views. In this paper, we present Consistent-1-to-3, which is a generative framework that significantly mitigate this issue. Specifically, we decompose the NVS task into two stages: (i) transforming observed regions to a novel view, and (ii) hallucinating unseen regions. We design a scene representation transformer and view-conditioned diffusion model for performing these two stages respectively. Inside the models, to enforce 3D consistency, we propose to employ epipolor-guided attention to incorporate geometry constraints, and multi-view attention to better aggregate multi-view information. Finally, we design a hierarchy generation paradigm to generate long sequences of consistent views, allowing a full 360 observation of the provided object image. Qualitative and quantitative evaluation over multiple datasets demonstrate the effectiveness of the proposed mechanisms against state-of-the-art approaches. Our project page is at https://jianglongye.com/consistent123/

* Project page: https://jianglongye.com/consistent123/

Via

Access Paper or Ask Questions

GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

Sep 01, 2023

Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, Xiaolong Wang

Abstract:It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments. To achieve this goal, the robot needs to have a comprehensive understanding of the 3D structure and semantics of the scene. In this work, we present $\textbf{GNFactor}$, a visual behavior cloning agent for multi-task robotic manipulation with $\textbf{G}$eneralizable $\textbf{N}$eural feature $\textbf{F}$ields. GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation. To incorporate semantics in 3D, the reconstruction module utilizes a vision-language foundation model ($\textit{e.g.}$, Stable Diffusion) to distill rich semantic information into the deep 3D voxel. We evaluate GNFactor on 3 real robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor. Our project website is https://yanjieze.com/GNFactor/ .

* CoRL 2023 Oral. Website: https://yanjieze.com/GNFactor/

Via

Access Paper or Ask Questions

MVDream: Multi-view Diffusion for 3D Generation

Aug 31, 2023

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, Xiao Yang

Abstract:We propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

* Our project page is https://MV-Dream.github.io

Via

Access Paper or Ask Questions

FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

Mar 22, 2023

Jianglong Ye, Naiyan Wang, Xiaolong Wang

Abstract:Recent works on generalizable NeRFs have shown promising results on novel view synthesis from single or few images. However, such models have rarely been applied on other downstream tasks beyond synthesis such as semantic understanding and parsing. In this paper, we propose a novel framework named FeatureNeRF to learn generalizable NeRFs by distilling pre-trained vision foundation models (e.g., DINO, Latent Diffusion). FeatureNeRF leverages 2D pre-trained foundation models to 3D space via neural rendering, and then extract deep features for 3D query points from NeRF MLPs. Consequently, it allows to map 2D images to continuous 3D semantic feature volumes, which can be used for various downstream tasks. We evaluate FeatureNeRF on tasks of 2D/3D semantic keypoint transfer and 2D/3D object part segmentation. Our extensive experiments demonstrate the effectiveness of FeatureNeRF as a generalizable 3D semantic feature extractor. Our project page is available at https://jianglongye.com/featurenerf/ .

* Project page: https://jianglongye.com/featurenerf/

Via

Access Paper or Ask Questions

Learning Continuous Grasping Function with a Dexterous Hand from Human Demonstrations

Jul 12, 2022

Jianglong Ye, Jiashun Wang, Binghao Huang, Yuzhe Qin, Xiaolong Wang

Figure 1 for Learning Continuous Grasping Function with a Dexterous Hand from Human Demonstrations

Figure 2 for Learning Continuous Grasping Function with a Dexterous Hand from Human Demonstrations

Figure 3 for Learning Continuous Grasping Function with a Dexterous Hand from Human Demonstrations

Figure 4 for Learning Continuous Grasping Function with a Dexterous Hand from Human Demonstrations

Abstract:We propose to learn to generate grasping motion for manipulation with a dexterous hand using implicit functions. With continuous time inputs, the model can generate a continuous and smooth grasping plan. We name the proposed model Continuous Grasping Function (CGF). CGF is learned via generative modeling with a Conditional Variational Autoencoder using 3D human demonstrations. We will first convert the large-scale human-object interaction trajectories to robot demonstrations via motion retargeting, and then use these demonstrations to train CGF. During inference, we perform sampling with CGF to generate different grasping plans in the simulator and select the successful ones to transfer to the real robot. By training on diverse human data, our CGF allows generalization to manipulate multiple objects. Compared to previous planning algorithms, CGF is more efficient and achieves significant improvement on success rate when transferred to grasping with the real Allegro Hand. Our project page is at https://jianglongye.com/cgf .

* Project page: https://jianglongye.com/cgf

Via

Access Paper or Ask Questions

GIFS: Neural Implicit Function for General Shape Representation

Apr 14, 2022

Jianglong Ye, Yuntao Chen, Naiyan Wang, Xiaolong Wang

Figure 1 for GIFS: Neural Implicit Function for General Shape Representation

Figure 2 for GIFS: Neural Implicit Function for General Shape Representation

Figure 3 for GIFS: Neural Implicit Function for General Shape Representation

Figure 4 for GIFS: Neural Implicit Function for General Shape Representation

Abstract:Recent development of neural implicit function has shown tremendous success on high-quality 3D shape reconstruction. However, most works divide the space into inside and outside of the shape, which limits their representing power to single-layer and watertight shapes. This limitation leads to tedious data processing (converting non-watertight raw data to watertight) as well as the incapability of representing general object shapes in the real world. In this work, we propose a novel method to represent general shapes including non-watertight shapes and shapes with multi-layer surfaces. We introduce General Implicit Function for 3D Shape (GIFS), which models the relationships between every two points instead of the relationships between points and surfaces. Instead of dividing 3D space into predefined inside-outside regions, GIFS encodes whether two points are separated by any surface. Experiments on ShapeNet show that GIFS outperforms previous state-of-the-art methods in terms of reconstruction quality, rendering efficiency, and visual fidelity. Project page is available at https://jianglongye.com/gifs .

* Accepted to CVPR 2022. Project page: https://jianglongye.com/gifs

Via

Access Paper or Ask Questions

Online Adaptation for Implicit Object Tracking and Shape Reconstruction in the Wild

Nov 24, 2021

Jianglong Ye, Yuntao Chen, Naiyan Wang, Xiaolong Wang

Figure 1 for Online Adaptation for Implicit Object Tracking and Shape Reconstruction in the Wild

Figure 2 for Online Adaptation for Implicit Object Tracking and Shape Reconstruction in the Wild

Figure 3 for Online Adaptation for Implicit Object Tracking and Shape Reconstruction in the Wild

Figure 4 for Online Adaptation for Implicit Object Tracking and Shape Reconstruction in the Wild

Abstract:Tracking and reconstructing 3D objects from cluttered scenes are the key components for computer vision, robotics and autonomous driving systems. While recent progress in implicit function (e.g., DeepSDF) has shown encouraging results on high-quality 3D shape reconstruction, it is still very challenging to generalize to cluttered and partially observable LiDAR data. In this paper, we propose to leverage the continuity in video data. We introduce a novel and unified framework which utilizes a DeepSDF model to simultaneously track and reconstruct 3D objects in the wild. We online adapt the DeepSDF model in the video, iteratively improving the shape reconstruction while in return improving the tracking, and vice versa. We experiment with both Waymo and KITTI datasets, and show significant improvements over state-of-the-art methods for both tracking and shape reconstruction.

Via

Access Paper or Ask Questions