Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yifei Shi

PartDexTOG: Generating Dexterous Task-Oriented Grasping via Language-driven Part Analysis

May 18, 2025

Weishang Wu, Yifei Shi, Zhizhong Chen, Zhipong Cai

Abstract:Task-oriented grasping is a crucial yet challenging task in robotic manipulation. Despite the recent progress, few existing methods address task-oriented grasping with dexterous hands. Dexterous hands provide better precision and versatility, enabling robots to perform task-oriented grasping more effectively. In this paper, we argue that part analysis can enhance dexterous grasping by providing detailed information about the object's functionality. We propose PartDexTOG, a method that generates dexterous task-oriented grasps via language-driven part analysis. Taking a 3D object and a manipulation task represented by language as input, the method first generates the category-level and part-level grasp descriptions w.r.t the manipulation task by LLMs. Then, a category-part conditional diffusion model is developed to generate a dexterous grasp for each part, respectively, based on the generated descriptions. To select the most plausible combination of grasp and corresponding part from the generated ones, we propose a measure of geometric consistency between grasp and part. We show that our method greatly benefits from the open-world knowledge reasoning on object parts by LLMs, which naturally facilitates the learning of grasp generation on objects with different geometry and for different manipulation tasks. Our method ranks top on the OakInk-shape dataset over all previous methods, improving the Penetration Volume, the Grasp Displace, and the P-FID over the state-of-the-art by $3.58\%$, $2.87\%$, and $41.43\%$, respectively. Notably, it demonstrates good generality in handling novel categories and tasks.

Via

Access Paper or Ask Questions

NVSPolicy: Adaptive Novel-View Synthesis for Generalizable Language-Conditioned Policy Learning

May 15, 2025

Le Shi, Yifei Shi, Xin Xu, Tenglong Liu, Junhua Xi, Chengyuan Chen

Abstract:Recent advances in deep generative models demonstrate unprecedented zero-shot generalization capabilities, offering great potential for robot manipulation in unstructured environments. Given a partial observation of a scene, deep generative models could generate the unseen regions and therefore provide more context, which enhances the capability of robots to generalize across unseen environments. However, due to the visual artifacts in generated images and inefficient integration of multi-modal features in policy learning, this direction remains an open challenge. We introduce NVSPolicy, a generalizable language-conditioned policy learning method that couples an adaptive novel-view synthesis module with a hierarchical policy network. Given an input image, NVSPolicy dynamically selects an informative viewpoint and synthesizes an adaptive novel-view image to enrich the visual context. To mitigate the impact of the imperfect synthesized images, we adopt a cycle-consistent VAE mechanism that disentangles the visual features into the semantic feature and the remaining feature. The two features are then fed into the hierarchical policy network respectively: the semantic feature informs the high-level meta-skill selection, and the remaining feature guides low-level action estimation. Moreover, we propose several practical mechanisms to make the proposed method efficient. Extensive experiments on CALVIN demonstrate the state-of-the-art performance of our method. Specifically, it achieves an average success rate of 90.4\% across all tasks, greatly outperforming the recent methods. Ablation studies confirm the significance of our adaptive novel-view synthesis paradigm. In addition, we evaluate NVSPolicy on a real-world robotic platform to demonstrate its practical applicability.

Via

Access Paper or Ask Questions

DK-SLAM: Monocular Visual SLAM with Deep Keypoints Adaptive Learning, Tracking and Loop-Closing

Jan 17, 2024

Hao Qu, Lilian Zhang, Jun Mao, Junbo Tie, Xiaofeng He, Xiaoping Hu, Yifei Shi, Changhao Chen

Figure 1 for DK-SLAM: Monocular Visual SLAM with Deep Keypoints Adaptive Learning, Tracking and Loop-Closing

Figure 2 for DK-SLAM: Monocular Visual SLAM with Deep Keypoints Adaptive Learning, Tracking and Loop-Closing

Figure 3 for DK-SLAM: Monocular Visual SLAM with Deep Keypoints Adaptive Learning, Tracking and Loop-Closing

Figure 4 for DK-SLAM: Monocular Visual SLAM with Deep Keypoints Adaptive Learning, Tracking and Loop-Closing

Abstract:Unreliable feature extraction and matching in handcrafted features undermine the performance of visual SLAM in complex real-world scenarios. While learned local features, leveraging CNNs, demonstrate proficiency in capturing high-level information and excel in matching benchmarks, they encounter challenges in continuous motion scenes, resulting in poor generalization and impacting loop detection accuracy. To address these issues, we present DK-SLAM, a monocular visual SLAM system with adaptive deep local features. MAML optimizes the training of these features, and we introduce a coarse-to-fine feature tracking approach. Initially, a direct method approximates the relative pose between consecutive frames, followed by a feature matching method for refined pose estimation. To counter cumulative positioning errors, a novel online learning binary feature-based online loop closure module identifies loop nodes within a sequence. Experimental results underscore DK-SLAM's efficacy, outperforms representative SLAM solutions, such as ORB-SLAM3 on publicly available datasets.

* In submission

Via

Access Paper or Ask Questions

Continual Learning through Networks Splitting and Merging with Dreaming-Meta-Weighted Model Fusion

Dec 12, 2023

Yi Sun, Xin Xu, Jian Li, Guanglei Xie, Yifei Shi, Qiang Fang

Abstract:It's challenging to balance the networks stability and plasticity in continual learning scenarios, considering stability suffers from the update of model and plasticity benefits from it. Existing works usually focus more on the stability and restrict the learning plasticity of later tasks to avoid catastrophic forgetting of learned knowledge. Differently, we propose a continual learning method named Split2MetaFusion which can achieve better trade-off by employing a two-stage strategy: splitting and meta-weighted fusion. In this strategy, a slow model with better stability, and a fast model with better plasticity are learned sequentially at the splitting stage. Then stability and plasticity are both kept by fusing the two models in an adaptive manner. Towards this end, we design an optimizer named Task-Preferred Null Space Projector(TPNSP) to the slow learning process for narrowing the fusion gap. To achieve better model fusion, we further design a Dreaming-Meta-Weighted fusion policy for better maintaining the old and new knowledge simultaneously, which doesn't require to use the previous datasets. Experimental results and analysis reported in this work demonstrate the superiority of the proposed method for maintaining networks stability and keeping its plasticity. Our code will be released.

Via

Access Paper or Ask Questions

SuperUDF: Self-supervised UDF Estimation for Surface Reconstruction

Aug 28, 2023

Hui Tian, Chenyang Zhu, Yifei Shi, Kai Xu

Abstract:Learning-based surface reconstruction based on unsigned distance functions (UDF) has many advantages such as handling open surfaces. We propose SuperUDF, a self-supervised UDF learning which exploits a learned geometry prior for efficient training and a novel regularization for robustness to sparse sampling. The core idea of SuperUDF draws inspiration from the classical surface approximation operator of locally optimal projection (LOP). The key insight is that if the UDF is estimated correctly, the 3D points should be locally projected onto the underlying surface following the gradient of the UDF. Based on that, a number of inductive biases on UDF geometry and a pre-learned geometry prior are devised to learn UDF estimation efficiently. A novel regularization loss is proposed to make SuperUDF robust to sparse sampling. Furthermore, we also contribute a learning-based mesh extraction from the estimated UDFs. Extensive evaluations demonstrate that SuperUDF outperforms the state of the arts on several public datasets in terms of both quality and efficiency. Code will be released after accteptance.

Via

Access Paper or Ask Questions

RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

Jul 16, 2023

Yifei Shi, Junhua Xi, Dewen Hu, Zhiping Cai, Kai Xu

Figure 1 for RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

Figure 2 for RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

Figure 3 for RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

Figure 4 for RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

Abstract:Learning-based multi-view stereo (MVS) has by far centered around 3D convolution on cost volumes. Due to the high computation and memory consumption of 3D CNN, the resolution of output depth is often considerably limited. Different from most existing works dedicated to adaptive refinement of cost volumes, we opt to directly optimize the depth value along each camera ray, mimicking the range finding of a laser scanner. This reduces the MVS problem to ray-based depth optimization which is much more light-weight than full cost volume optimization. In particular, we propose RayMVSNet which learns sequential prediction of a 1D implicit field along each camera ray with the zero-crossing point indicating scene depth. This sequential modeling, conducted based on transformer features, essentially learns the epipolar line search in traditional multi-view stereo. We devise a multi-task learning for better optimization convergence and depth accuracy. We found the monotonicity property of the SDFs along each ray greatly benefits the depth estimation. Our method ranks top on both the DTU and the Tanks & Temples datasets over all previous learning-based methods, achieving an overall reconstruction score of 0.33mm on DTU and an F-score of 59.48% on Tanks & Temples. It is able to produce high-quality depth estimation and point cloud reconstruction in challenging scenarios such as objects/scenes with non-textured surface, severe occlusion, and highly varying depth range. Further, we propose RayMVSNet++ to enhance contextual feature aggregation for each ray through designing an attentional gating unit to select semantically relevant neighboring rays within the local frustum around that ray. RayMVSNet++ achieves state-of-the-art performance on the ScanNet dataset. In particular, it attains an AbsRel of 0.058m and produces accurate results on the two subsets of textureless regions and large depth variation.

* IEEE Transactions on Pattern Analysis and Machine Intelligence. arXiv admin note: substantial text overlap with arXiv:2204.01320

Via

Access Paper or Ask Questions

Learning Task-preferred Inference Routes for Gradient De-conflict in Multi-output DNNs

May 31, 2023

Yi Sun, Xin Xu, Jian Li, Xiaochang Hu, Yifei Shi, Ling-Li Zeng

Figure 1 for Learning Task-preferred Inference Routes for Gradient De-conflict in Multi-output DNNs

Figure 2 for Learning Task-preferred Inference Routes for Gradient De-conflict in Multi-output DNNs

Figure 3 for Learning Task-preferred Inference Routes for Gradient De-conflict in Multi-output DNNs

Figure 4 for Learning Task-preferred Inference Routes for Gradient De-conflict in Multi-output DNNs

Abstract:Multi-output deep neural networks(MONs) contain multiple task branches, and these tasks usually share partial network filters that lead to the entanglement of different task inference routes. Due to the inconsistent optimization objectives, the task gradients used for training MONs will interfere with each other on the shared routes, which will decrease the overall model performance. To address this issue, we propose a novel gradient de-conflict algorithm named DR-MGF(Dynamic Routes and Meta-weighted Gradient Fusion) in this work. Different from existing de-conflict methods, DR-MGF achieves gradient de-conflict in MONs by learning task-preferred inference routes. The proposed method is motivated by our experimental findings: the shared filters are not equally important to different tasks. By designing the learnable task-specific importance variables, DR-MGF evaluates the importance of filters for different tasks. Through making the dominances of tasks over filters be proportional to the task-specific importance of filters, DR-MGF can effectively reduce the inter-task interference. The task-specific importance variables ultimately determine task-preferred inference routes at the end of training iterations. Extensive experimental results on CIFAR, ImageNet, and NYUv2 illustrate that DR-MGF outperforms the existing de-conflict methods both in prediction accuracy and convergence speed of MONs. Furthermore, DR-MGF can be extended to general MONs without modifying the overall network structures.

* 15 pages

Via

Access Paper or Ask Questions

SOCS: Semantically-aware Object Coordinate Space for Category-Level 6D Object Pose Estimation under Large Shape Variations

Mar 18, 2023

Boyan Wan, Yifei Shi, Kai Xu

Abstract:Most learning-based approaches to category-level 6D pose estimation are design around normalized object coordinate space (NOCS). While being successful, NOCS-based methods become inaccurate and less robust when handling objects of a category containing significant intra-category shape variations. This is because the object coordinates induced by global and rigid alignment of objects are semantically incoherent, making the coordinate regression hard to learn and generalize. We propose Semantically-aware Object Coordinate Space (SOCS) built by warping-and-aligning the objects guided by a sparse set of keypoints with semantically meaningful correspondence. SOCS is semantically coherent: Any point on the surface of a object can be mapped to a semantically meaningful location in SOCS, allowing for accurate pose and size estimation under large shape variations. To learn effective coordinate regression to SOCS, we propose a novel multi-scale coordinate-based attention network. Evaluations demonstrate that our method is easy to train, well-generalizing for large intra-category shape variations and robust to inter-object occlusions.

* 10 pages

Via

Access Paper or Ask Questions

RayMVSNet: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

Apr 04, 2022

Junhua Xi, Yifei Shi, Yijie Wang, Yulan Guo, Kai Xu

* cvpr 2022, 11 pages

Via

Access Paper or Ask Questions

3DRM:Pair-wise relation module for 3D object detection

Feb 20, 2022

Yuqing Lan, Yao Duan, Yifei Shi, Hui Huang, Kai Xu

Figure 1 for 3DRM:Pair-wise relation module for 3D object detection

Figure 2 for 3DRM:Pair-wise relation module for 3D object detection

Figure 3 for 3DRM:Pair-wise relation module for 3D object detection

Figure 4 for 3DRM:Pair-wise relation module for 3D object detection

Abstract:Context has proven to be one of the most important factors in object layout reasoning for 3D scene understanding. Existing deep contextual models either learn holistic features for context encoding or rely on pre-defined scene templates for context modeling. We argue that scene understanding benefits from object relation reasoning, which is capable of mitigating the ambiguity of 3D object detections and thus helps locate and classify the 3D objects more accurately and robustly. To achieve this, we propose a novel 3D relation module (3DRM) which reasons about object relations at pair-wise levels. The 3DRM predicts the semantic and spatial relationships between objects and extracts the object-wise relation features. We demonstrate the effects of 3DRM by plugging it into proposal-based and voting-based 3D object detection pipelines, respectively. Extensive evaluations show the effectiveness and generalization of 3DRM on 3D object detection. Our source code is available at https://github.com/lanlan96/3DRM.

* Computers & Graphics, 2021, Vol. 98, Pages: 58 - 70
* 13 pages, 8 figures

Via

Access Paper or Ask Questions