Abstract:Foundation models pre-trained on web-scale data are shown to encapsulate extensive world knowledge beneficial for robotic manipulation in the form of task planning. However, the actual physical implementation of these plans often relies on task-specific learning methods, which require significant data collection and struggle with generalizability. In this work, we introduce Robotic Manipulation through Spatial Constraints of Parts (CoPa), a novel framework that leverages the common sense knowledge embedded within foundation models to generate a sequence of 6-DoF end-effector poses for open-world robotic manipulation. Specifically, we decompose the manipulation process into two phases: task-oriented grasping and task-aware motion planning. In the task-oriented grasping phase, we employ foundation vision-language models (VLMs) to select the object's grasping part through a novel coarse-to-fine grounding mechanism. During the task-aware motion planning phase, VLMs are utilized again to identify the spatial geometry constraints of task-relevant object parts, which are then used to derive post-grasp poses. We also demonstrate how CoPa can be seamlessly integrated with existing robotic planning algorithms to accomplish complex, long-horizon tasks. Our comprehensive real-world experiments show that CoPa possesses a fine-grained physical understanding of scenes, capable of handling open-set instructions and objects with minimal prompt engineering and without additional training. Project page: https://copa-2024.github.io/
Abstract:Image harmonization targets at adjusting the foreground in a composite image to make it compatible with the background, producing a more realistic and harmonious image. Training deep image harmonization network requires abundant training data, but it is extremely difficult to acquire training pairs of composite images and ground-truth harmonious images. Therefore, existing works turn to adjust the foreground appearance in a real image to create a synthetic composite image. However, such adjustment may not faithfully reflect the natural illumination change of foreground. In this work, we explore a novel transitive way to construct image harmonization dataset. Specifically, based on the existing datasets with recorded illumination information, we first convert the foreground in a real image to the standard illumination condition, and then convert it to another illumination condition, which is combined with the original background to form a synthetic composite image. In this manner, we construct an image harmonization dataset called ccHarmony, which is named after color checker (cc). The dataset is available at https://github.com/bcmi/Image-Harmonization-Dataset-ccHarmony.