Abstract:In warehouse environments, robots require robust picking capabilities to manage a wide variety of objects. Effective deployment demands minimal hardware, strong generalization to new products, and resilience in diverse settings. Current methods often rely on depth sensors for structural information, which suffer from high costs, complex setups, and technical limitations. Inspired by recent advancements in computer vision, we propose an innovative approach that leverages foundation models to enhance suction grasping using only RGB images. Trained solely on a synthetic dataset, our method generalizes its grasp prediction capabilities to real-world robots and a diverse range of novel objects not included in the training set. Our network achieves an 82.3\% success rate in real-world applications. The project website with code and data will be available at http://optigrasp.github.io.
Abstract:Depth perception is essential for a robot's spatial and geometric understanding of its environment, with many tasks traditionally relying on hardware-based depth sensors like RGB-D or stereo cameras. However, these sensors face practical limitations, including issues with transparent and reflective objects, high costs, calibration complexity, spatial and energy constraints, and increased failure rates in compound systems. While monocular depth estimation methods offer a cost-effective and simpler alternative, their adoption in robotics is limited due to their output of relative rather than metric depth, which is crucial for robotics applications. In this paper, we propose a method that utilizes a single calibrated camera, enabling the robot to act as a ``measuring stick" to convert relative depth estimates into metric depth in real-time as tasks are performed. Our approach employs an LSTM-based metric depth regressor, trained online and refined through probabilistic filtering, to accurately restore the metric depth across the monocular depth map, particularly in areas proximal to the robot's motion. Experiments with real robots demonstrate that our method significantly outperforms current state-of-the-art monocular metric depth estimation techniques, achieving a 22.1% reduction in depth error and a 52% increase in success rate for a downstream task.