Abstract:This paper introduces GS-Pose, an end-to-end framework for locating and estimating the 6D pose of objects. GS-Pose begins with a set of posed RGB images of a previously unseen object and builds three distinct representations stored in a database. At inference, GS-Pose operates sequentially by locating the object in the input image, estimating its initial 6D pose using a retrieval approach, and refining the pose with a render-and-compare method. The key insight is the application of the appropriate object representation at each stage of the process. In particular, for the refinement step, we utilize 3D Gaussian splatting, a novel differentiable rendering technique that offers high rendering speed and relatively low optimization time. Off-the-shelf toolchains and commodity hardware, such as mobile phones, can be used to capture new objects to be added to the database. Extensive evaluations on the LINEMOD and OnePose-LowTexture datasets demonstrate excellent performance, establishing the new state-of-the-art. Project page: https://dingdingcai.github.io/gs-pose.
Abstract:Point clouds have been a recent interest for ray tracing-based radio channel characterization, as sensors such as RGB-D cameras and laser scanners can be utilized to generate an accurate virtual copy of a physical environment. In this paper, a novel ray launching algorithm is presented, which operates directly on noisy point clouds acquired from sensor data. It produces coarse paths that are further refined to exact paths consisting of reflections and diffractions. A commercial ray tracing tool is utilized as the baseline for validating the simulated paths. A significant majority of the baseline paths is found. The robustness to noise is examined by artificially applying noise along the normal vector of each point. It is observed that the proposed method is capable of adapting to noise and finds similar paths compared to the baseline path trajectories with noisy point clouds. This is prevalent especially if the normal vectors of the points are estimated accurately. Lastly, a simulation is performed with a reconstructed point cloud and compared against channel measurements and the baseline paths. The resulting paths demonstrate similarity with the baseline path trajectories and exhibit an analogous pattern to the aggregated impulse response extracted from the measurements. Code available at https://github.com/nvaara/NimbusRT
Abstract:Ray tracing is a deterministic method that produces propagation paths between a transmitter and a receiver. The simulation accuracy is significantly influenced by the environment details. One way to capture the environment with great precision is the utilization of depth sensors and cameras. Such reconstructed environment is in the form of a point cloud. However, utilizing such data directly in ray tracing, a key aspect on vision-aided wireless communications, often involves a significant trade-off between accuracy and execution time. In this paper, we propose an open source novel and fast point cloud-based ray launching algorithm that produces exact paths, which provide a good basis for accurate modeling of radio channel characteristics. In experiments, preliminary validation of the ray tracer output is obtained with the aid of a commercial ray tracer.
Abstract:Despite the impressive performance of recent unbiased Scene Graph Generation (SGG) methods, the current debiasing literature mainly focuses on the long-tailed distribution problem, whereas it overlooks another source of bias, i.e., semantic confusion, which makes the SGG model prone to yield false predictions for similar relationships. In this paper, we explore a debiasing procedure for the SGG task leveraging causal inference. Our central insight is that the Sparse Mechanism Shift (SMS) in causality allows independent intervention on multiple biases, thereby potentially preserving head category performance while pursuing the prediction of high-informative tail relationships. However, the noisy datasets lead to unobserved confounders for the SGG task, and thus the constructed causal models are always causal-insufficient to benefit from SMS. To remedy this, we propose Two-stage Causal Modeling (TsCM) for the SGG task, which takes the long-tailed distribution and semantic confusion as confounders to the Structural Causal Model (SCM) and then decouples the causal intervention into two stages. The first stage is causal representation learning, where we use a novel Population Loss (P-Loss) to intervene in the semantic confusion confounder. The second stage introduces the Adaptive Logit Adjustment (AL-Adjustment) to eliminate the long-tailed distribution confounder to complete causal calibration learning. These two stages are model agnostic and thus can be used in any SGG model that seeks unbiased predictions. Comprehensive experiments conducted on the popular SGG backbones and benchmarks show that our TsCM can achieve state-of-the-art performance in terms of mean recall rate. Furthermore, TsCM can maintain a higher recall rate than other debiasing methods, which indicates that our method can achieve a better tradeoff between head and tail relationships.
Abstract:Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations.
Abstract:Acquiring labeled 6D poses from real images is an expensive and time-consuming task. Though massive amounts of synthetic RGB images are easy to obtain, the models trained on them suffer from noticeable performance degradation due to the synthetic-to-real domain gap. To mitigate this degradation, we propose a practical self-supervised domain adaptation approach that takes advantage of real RGB(-D) data without needing real pose labels. We first pre-train the model with synthetic RGB images and then utilize real RGB(-D) images to fine-tune the pre-trained model. The fine-tuning process is self-supervised by the RGB-based pose-aware consistency and the depth-guided object distance pseudo-label, which does not require the time-consuming online differentiable rendering. We build our domain adaptation method based on the recent pose estimator SC6D and evaluate it on the YCB-Video dataset. We experimentally demonstrate that our method achieves comparable performance against its fully-supervised counterpart while outperforming existing state-of-the-art approaches.
Abstract:Many computer vision applications require robust and efficient estimation of camera geometry from a minimal number of input data measurements, ie, solving minimal problems in a RANSAC framework. Minimal problems are usually formulated as complex systems of polynomial equations. Many state-of-the-art efficient polynomial solvers are based on the action matrix method that has been automated and highly optimised in recent years. In this paper we explore the theory of sparse resultants for generating minimal solvers and propose a novel approach based on a using an extra polynomial with a special form. We show that for some camera geometry problems our extra polynomial-based method leads to smaller and more stable solvers than the state-of-the-art Gr\"obner basis-based solvers. The proposed method can be fully automated and incorporated into existing tools for automatic generation of efficient polynomial solvers. It provides a competitive alternative to popular Gr\"obner basis-based methods for minimal problems in computer vision. Additionally, we study the conditions under which the minimal solvers generated by the state-of-the-art action matrix-based methods and the proposed extra polynomial resultant-based method, are equivalent. Specifically we consider a step-by-step comparison between the approaches based on the action matrix and the sparse resultant, followed by a set of substitutions, which would lead to equivalent minimal solvers.
Abstract:Various datasets have been proposed for simultaneous localization and mapping (SLAM) and related problems. Existing datasets often include small environments, have incomplete ground truth, or lack important sensor data, such as depth and infrared images. We propose an easy-to-use framework for acquiring building-scale 3D reconstruction using a consumer depth camera. Unlike complex and expensive acquisition setups, our system enables crowd-sourcing, which can greatly benefit data-hungry algorithms. Compared to similar systems, we utilize raw depth maps for odometry computation and loop closure refinement which results in better reconstructions. We acquire a building-scale 3D dataset (BS3D) and demonstrate its value by training an improved monocular depth estimation model. As a unique experiment, we benchmark visual-inertial odometry methods using both color and active infrared images.
Abstract:This paper presents an efficient symmetry-agnostic and correspondence-free framework, referred to as SC6D, for 6D object pose estimation from a single monocular RGB image. SC6D requires neither the 3D CAD model of the object nor any prior knowledge of the symmetries. The pose estimation is decomposed into three sub-tasks: a) object 3D rotation representation learning and matching; b) estimation of the 2D location of the object center; and c) scale-invariant distance estimation (the translation along the z-axis) via classification. SC6D is evaluated on three benchmark datasets, T-LESS, YCB-V, and ITODD, and results in state-of-the-art performance on the T-LESS dataset. Moreover, SC6D is computationally much more efficient than the previous state-of-the-art method SurfEmb. The implementation and pre-trained models are publicly available at https://github.com/dingdingcai/SC6D-pose.
Abstract:This paper proposes a universal framework, called OVE6D, for model-based 6D object pose estimation from a single depth image and a target object mask. Our model is trained using purely synthetic data rendered from ShapeNet, and, unlike most of the existing methods, it generalizes well on new real-world objects without any fine-tuning. We achieve this by decomposing the 6D pose into viewpoint, in-plane rotation around the camera optical axis and translation, and introducing novel lightweight modules for estimating each component in a cascaded manner. The resulting network contains less than 4M parameters while demonstrating excellent performance on the challenging T-LESS and Occluded LINEMOD datasets without any dataset-specific training. We show that OVE6D outperforms some contemporary deep learning-based pose estimation methods specifically trained for individual objects or datasets with real-world training data. The implementation and the pre-trained model will be made publicly available.