Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiehong Lin

PicoPose: Progressive Pixel-to-Pixel Correspondence Learning for Novel Object Pose Estimation

Apr 03, 2025

Lihua Liu, Jiehong Lin, Zhenxin Liu, Kui Jia

Abstract:Novel object pose estimation from RGB images presents a significant challenge for zero-shot generalization, as it involves estimating the relative 6D transformation between an RGB observation and a CAD model of an object that was not seen during training. In this paper, we introduce PicoPose, a novel framework designed to tackle this task using a three-stage pixel-to-pixel correspondence learning process. Firstly, PicoPose matches features from the RGB observation with those from rendered object templates, identifying the best-matched template and establishing coarse correspondences. Secondly, PicoPose smooths the correspondences by globally regressing a 2D affine transformation, including in-plane rotation, scale, and 2D translation, from the coarse correspondence map. Thirdly, PicoPose applies the affine transformation to the feature map of the best-matched template and learns correspondence offsets within local regions to achieve fine-grained correspondences. By progressively refining the correspondences, PicoPose significantly improves the accuracy of object poses computed via PnP/RANSAC. PicoPose achieves state-of-the-art performance on the seven core datasets of the BOP benchmark, demonstrating exceptional generalization to novel objects represented by CAD models or object reference images. Code and models are available at https://github.com/foollh/PicoPose.

Via

Access Paper or Ask Questions

SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

Nov 27, 2023

Jiehong Lin, Lihua Liu, Dekun Lu, Kui Jia

Figure 1 for SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

Figure 2 for SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

Figure 3 for SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

Figure 4 for SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

Abstract:Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, which provides a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including instance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance and geometry. By treating pose estimation as a partial-to-partial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence, ultimately yielding the pose estimates. Without bells and whistles, SAM-6D outperforms the existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects.

* Github Page: https://github.com/JiehongLin/SAM-6D

Via

Access Paper or Ask Questions

VI-Net: Boosting Category-level 6D Object Pose Estimation via Learning Decoupled Rotations on the Spherical Representations

Aug 19, 2023

Jiehong Lin, Zewei Wei, Yabin Zhang, Kui Jia

Figure 1 for VI-Net: Boosting Category-level 6D Object Pose Estimation via Learning Decoupled Rotations on the Spherical Representations

Figure 2 for VI-Net: Boosting Category-level 6D Object Pose Estimation via Learning Decoupled Rotations on the Spherical Representations

Figure 3 for VI-Net: Boosting Category-level 6D Object Pose Estimation via Learning Decoupled Rotations on the Spherical Representations

Figure 4 for VI-Net: Boosting Category-level 6D Object Pose Estimation via Learning Decoupled Rotations on the Spherical Representations

Abstract:Rotation estimation of high precision from an RGB-D object observation is a huge challenge in 6D object pose estimation, due to the difficulty of learning in the non-linear space of SO(3). In this paper, we propose a novel rotation estimation network, termed as VI-Net, to make the task easier by decoupling the rotation as the combination of a viewpoint rotation and an in-plane rotation. More specifically, VI-Net bases the feature learning on the sphere with two individual branches for the estimates of two factorized rotations, where a V-Branch is employed to learn the viewpoint rotation via binary classification on the spherical signals, while another I-Branch is used to estimate the in-plane rotation by transforming the signals to view from the zenith direction. To process the spherical signals, a Spherical Feature Pyramid Network is constructed based on a novel design of SPAtial Spherical Convolution (SPA-SConv), which settles the boundary problem of spherical signals via feature padding and realizesviewpoint-equivariant feature extraction by symmetric convolutional operations. We apply the proposed VI-Net to the challenging task of category-level 6D object pose estimation for predicting the poses of unknown objects without available CAD models; experiments on the benchmarking datasets confirm the efficacy of our method, which outperforms the existing ones with a large margin in the regime of high precision.

* Accepted by ICCV2023. Project Page: https://github.com/JiehongLin/VI-Net

Via

Access Paper or Ask Questions

Manifold-Aware Self-Training for Unsupervised Domain Adaptation on Regressing 6D Object Pose

May 18, 2023

Yichen Zhang, Jiehong Lin, Ke Chen, Zelin Xu, Yaowei Wang, Kui Jia

Figure 1 for Manifold-Aware Self-Training for Unsupervised Domain Adaptation on Regressing 6D Object Pose

Figure 2 for Manifold-Aware Self-Training for Unsupervised Domain Adaptation on Regressing 6D Object Pose

Figure 3 for Manifold-Aware Self-Training for Unsupervised Domain Adaptation on Regressing 6D Object Pose

Figure 4 for Manifold-Aware Self-Training for Unsupervised Domain Adaptation on Regressing 6D Object Pose

Abstract:Domain gap between synthetic and real data in visual regression (\eg 6D pose estimation) is bridged in this paper via global feature alignment and local refinement on the coarse classification of discretized anchor classes in target space, which imposes a piece-wise target manifold regularization into domain-invariant representation learning. Specifically, our method incorporates an explicit self-supervised manifold regularization, revealing consistent cumulative target dependency across domains, to a self-training scheme (\eg the popular Self-Paced Self-Training) to encourage more discriminative transferable representations of regression tasks. Moreover, learning unified implicit neural functions to estimate relative direction and distance of targets to their nearest class bins aims to refine target classification predictions, which can gain robust performance against inconsistent feature scaling sensitive to UDA regressors. Experiment results on three public benchmarks of the challenging 6D pose estimation task can verify the effectiveness of our method, consistently achieving superior performance to the state-of-the-art for UDA on 6D pose estimation.

* Accepted by IJCAI 2023

Via

Access Paper or Ask Questions

Point-DAE: Denoising Autoencoders for Self-supervised Point Cloud Learning

Nov 13, 2022

Yabin Zhang, Jiehong Lin, Ruihuang Li, Kui Jia, Lei Zhang

Figure 1 for Point-DAE: Denoising Autoencoders for Self-supervised Point Cloud Learning

Figure 2 for Point-DAE: Denoising Autoencoders for Self-supervised Point Cloud Learning

Figure 3 for Point-DAE: Denoising Autoencoders for Self-supervised Point Cloud Learning

Figure 4 for Point-DAE: Denoising Autoencoders for Self-supervised Point Cloud Learning

Abstract:Masked autoencoder has demonstrated its effectiveness in self-supervised point cloud learning. Considering that masking is a kind of corruption, in this work we explore a more general denoising autoencoder for point cloud learning (Point-DAE) by investigating more types of corruptions beyond masking. Specifically, we degrade the point cloud with certain corruptions as input, and learn an encoder-decoder model to reconstruct the original point cloud from its corrupted version. Three corruption families (i.e., density/masking, noise, and affine transformation) and a total of fourteen corruption types are investigated. Interestingly, the affine transformation-based Point-DAE generally outperforms others (e.g., the popular masking corruptions), suggesting a promising direction for self-supervised point cloud learning. More importantly, we find a statistically significant linear relationship between task relatedness and model performance on downstream tasks. This finding partly demystifies the advantage of affine transformation-based Point-DAE, given that such Point-DAE variants are closely related to the downstream classification task. Additionally, we reveal that most Point-DAE variants unintentionally benefit from the manually-annotated canonical poses in the pre-training dataset. To tackle such an issue, we promote a new dataset setting by estimating object poses automatically. The codes will be available at \url{https://github.com/YBZh/Point-DAE.}

* Codes will be available at https://github.com/YBZh/Point-DAE

Via

Access Paper or Ask Questions

DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation

Oct 11, 2022

Hongyang Li, Jiehong Lin, Kui Jia

Figure 1 for DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation

Figure 2 for DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation

Figure 3 for DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation

Figure 4 for DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation

Abstract:Establishment of point correspondence between camera and object coordinate systems is a promising way to solve 6D object poses. However, surrogate objectives of correspondence learning in 3D space are a step away from the true ones of object pose estimation, making the learning suboptimal for the end task. In this paper, we address this shortcoming by introducing a new method of Deep Correspondence Learning Network for direct 6D object pose estimation, shortened as DCL-Net. Specifically, DCL-Net employs dual newly proposed Feature Disengagement and Alignment (FDA) modules to establish, in the feature space, partial-to-partial correspondence and complete-to-complete one for partial object observation and its complete CAD model, respectively, which result in aggregated pose and match feature pairs from two coordinate systems; these two FDA modules thus bring complementary advantages. The match feature pairs are used to learn confidence scores for measuring the qualities of deep correspondence, while the pose feature pairs are weighted by confidence scores for direct object pose regression. A confidence-based pose refinement network is also proposed to further improve pose precision in an iterative manner. Extensive experiments show that DCL-Net outperforms existing methods on three benchmarking datasets, including YCB-Video, LineMOD, and Oclussion-LineMOD; ablation studies also confirm the efficacy of our novel designs.

* ECCV 2022

Via

Access Paper or Ask Questions

Category-Level 6D Object Pose and Size Estimation using Self-Supervised Deep Prior Deformation Networks

Jul 12, 2022

Jiehong Lin, Zewei Wei, Changxing Ding, Kui Jia

Figure 1 for Category-Level 6D Object Pose and Size Estimation using Self-Supervised Deep Prior Deformation Networks

Figure 2 for Category-Level 6D Object Pose and Size Estimation using Self-Supervised Deep Prior Deformation Networks

Figure 3 for Category-Level 6D Object Pose and Size Estimation using Self-Supervised Deep Prior Deformation Networks

Figure 4 for Category-Level 6D Object Pose and Size Estimation using Self-Supervised Deep Prior Deformation Networks

Abstract:It is difficult to precisely annotate object instances and their semantics in 3D space, and as such, synthetic data are extensively used for these tasks, e.g., category-level 6D object pose and size estimation. However, the easy annotations in synthetic domains bring the downside effect of synthetic-to-real (Sim2Real) domain gap. In this work, we aim to address this issue in the task setting of Sim2Real, unsupervised domain adaptation for category-level 6D object pose and size estimation. We propose a method that is built upon a novel Deep Prior Deformation Network, shortened as DPDN. DPDN learns to deform features of categorical shape priors to match those of object observations, and is thus able to establish deep correspondence in the feature space for direct regression of object poses and sizes. To reduce the Sim2Real domain gap, we formulate a novel self-supervised objective upon DPDN via consistency learning; more specifically, we apply two rigid transformations to each object observation in parallel, and feed them into DPDN respectively to yield dual sets of predictions; on top of the parallel learning, an inter-consistency term is employed to keep cross consistency between dual predictions for improving the sensitivity of DPDN to pose changes, while individual intra-consistency ones are used to enforce self-adaptation within each learning itself. We train DPDN on both training sets of the synthetic CAMERA25 and real-world REAL275 datasets; our results outperform the existing methods on REAL275 test set under both the unsupervised and supervised settings. Ablation studies also verify the efficacy of our designs. Our code is released publicly at https://github.com/JiehongLin/Self-DPDN.

* Accepted by ECCV2022

Via

Access Paper or Ask Questions

Masked Surfel Prediction for Self-Supervised Point Cloud Learning

Jul 07, 2022

Yabin Zhang, Jiehong Lin, Chenhang He, Yongwei Chen, Kui Jia, Lei Zhang

Figure 1 for Masked Surfel Prediction for Self-Supervised Point Cloud Learning

Figure 2 for Masked Surfel Prediction for Self-Supervised Point Cloud Learning

Figure 3 for Masked Surfel Prediction for Self-Supervised Point Cloud Learning

Figure 4 for Masked Surfel Prediction for Self-Supervised Point Cloud Learning

Abstract:Masked auto-encoding is a popular and effective self-supervised learning approach to point cloud learning. However, most of the existing methods reconstruct only the masked points and overlook the local geometry information, which is also important to understand the point cloud data. In this work, we make the first attempt, to the best of our knowledge, to consider the local geometry information explicitly into the masked auto-encoding, and propose a novel Masked Surfel Prediction (MaskSurf) method. Specifically, given the input point cloud masked at a high ratio, we learn a transformer-based encoder-decoder network to estimate the underlying masked surfels by simultaneously predicting the surfel positions (i.e., points) and per-surfel orientations (i.e., normals). The predictions of points and normals are supervised by the Chamfer Distance and a newly introduced Position-Indexed Normal Distance in a set-to-set manner. Our MaskSurf is validated on six downstream tasks under three fine-tuning strategies. In particular, MaskSurf outperforms its closest competitor, Point-MAE, by 1.2\% on the real-world dataset of ScanObjectNN under the OBJ-BG setting, justifying the advantages of masked surfel prediction over masked point cloud reconstruction. Codes will be available at https://github.com/YBZh/MaskSurf.

* Codes will be available at https://github.com/YBZh/MaskSurf

Via

Access Paper or Ask Questions

Sparse Steerable Convolutions: An Efficient Learning of SE(3)-Equivariant Features for Estimation and Tracking of Object Poses in 3D Space

Nov 14, 2021

Jiehong Lin, Hongyang Li, Ke Chen, Jiangbo Lu, Kui Jia

Figure 1 for Sparse Steerable Convolutions: An Efficient Learning of SE(3)-Equivariant Features for Estimation and Tracking of Object Poses in 3D Space

Figure 2 for Sparse Steerable Convolutions: An Efficient Learning of SE(3)-Equivariant Features for Estimation and Tracking of Object Poses in 3D Space

Figure 3 for Sparse Steerable Convolutions: An Efficient Learning of SE(3)-Equivariant Features for Estimation and Tracking of Object Poses in 3D Space

Figure 4 for Sparse Steerable Convolutions: An Efficient Learning of SE(3)-Equivariant Features for Estimation and Tracking of Object Poses in 3D Space

Abstract:As a basic component of SE(3)-equivariant deep feature learning, steerable convolution has recently demonstrated its advantages for 3D semantic analysis. The advantages are, however, brought by expensive computations on dense, volumetric data, which prevent its practical use for efficient processing of 3D data that are inherently sparse. In this paper, we propose a novel design of Sparse Steerable Convolution (SS-Conv) to address the shortcoming; SS-Conv greatly accelerates steerable convolution with sparse tensors, while strictly preserving the property of SE(3)-equivariance. Based on SS-Conv, we propose a general pipeline for precise estimation of object poses, wherein a key design is a Feature-Steering module that takes the full advantage of SE(3)-equivariance and is able to conduct an efficient pose refinement. To verify our designs, we conduct thorough experiments on three tasks of 3D object semantic analysis, including instance-level 6D pose estimation, category-level 6D pose and size estimation, and category-level 6D pose tracking. Our proposed pipeline based on SS-Conv outperforms existing methods on almost all the metrics evaluated by the three tasks. Ablation studies also show the superiority of our SS-Conv over alternative convolutions in terms of both accuracy and efficiency. Our code is released publicly at https://github.com/Gorilla-Lab-SCUT/SS-Conv.

* Accepted by NeurIPS 2021

Via

Access Paper or Ask Questions

DualPoseNet: Category-level 6D Object Pose and Size Estimation using Dual Pose Network with Refined Learning of Pose Consistency

Apr 06, 2021

Jiehong Lin, Zewei Wei, Zhihao Li, Songcen Xu, Kui Jia, Yuanqing Li

Figure 1 for DualPoseNet: Category-level 6D Object Pose and Size Estimation using Dual Pose Network with Refined Learning of Pose Consistency

Figure 2 for DualPoseNet: Category-level 6D Object Pose and Size Estimation using Dual Pose Network with Refined Learning of Pose Consistency

Figure 3 for DualPoseNet: Category-level 6D Object Pose and Size Estimation using Dual Pose Network with Refined Learning of Pose Consistency

Figure 4 for DualPoseNet: Category-level 6D Object Pose and Size Estimation using Dual Pose Network with Refined Learning of Pose Consistency

Abstract:Category-level 6D object pose and size estimation is to predict 9 degrees-of-freedom (9DoF) pose configurations of rotation, translation, and size for object instances observed in single, arbitrary views of cluttered scenes. It extends previous related tasks with learning of the two additional rotation angles. This seemingly small difference poses technical challenges due to the learning and prediction in the full rotation space of SO(3). In this paper, we propose a new method of Dual Pose Network with refined learning of pose consistency for this task, shortened as DualPoseNet. DualPoseNet stacks two parallel pose decoders on top of a shared pose encoder, where the implicit decoder predicts object poses with a working mechanism different from that of the explicit one; they thus impose complementary supervision on the training of pose encoder. We construct the encoder based on spherical convolutions, and design a module of Spherical Fusion wherein for a better embedding of pose-sensitive features from the appearance and shape observations. Given no the testing CAD models, it is the novel introduction of the implicit decoder that enables the refined pose prediction during testing, by enforcing the predicted pose consistency between the two decoders using a self-adaptive loss term. Thorough experiments on the benchmark 9DoF object pose datasets of CAMERA25 and REAL275 confirm efficacy of our designs. DualPoseNet outperforms existing methods with a large margin in the regime of high precision.

Via

Access Paper or Ask Questions