https://pku-epic.github.io/MaskClustering.
Open-vocabulary 3D instance segmentation has emerged as a frontier topic due to its capability to segment 3D instances beyond a predefined set of categories. However, compared to significant progress in the 2D domain, methods for 3D open-vocabulary instance segmentation are hindered by the limited scale of high-quality annotated 3D data. To harness the capabilities of 2D models, recent efforts have focused on merging 2D masks based on metrics such as geometric and semantic similarity to form 3D instances. In contrast to these local metrics, we propose a novel metric called view consensus to better exploit multi-view observation. The key insight is that two 2D masks should be considered as belonging to the same instance if a considerable number of other 2D masks from other views contain both these two masks. Based on this metric, we build a global mask graph and iteratively cluster masks, prioritizing mask pairs with solid view consensus. The corresponding 3D points cluster of these 2D mask clusters can be regarded as 3D instances, along with the fused open-vocabulary features from clustered 2D masks. Through this multi-view verification and fusion mechanism, our method effectively leverages the prior instance knowledge from massive 2D masks predicted by visual foundation models, eliminating the need for training on 3D data. Experiments on publicly available datasets, including ScanNet200 and MatterPort3D, demonstrate that our method achieves state-of-the-art performance in both open-vocabulary instance segmentation and class-agnostic mask generation. Our project page is at