Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaoxu Xu

DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation

Nov 13, 2025

Xuexun Liu, Xiaoxu Xu, Qiudan Zhang, Lin Ma, Xu Wang

Figure 1 for DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation

Figure 2 for DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation

Figure 3 for DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation

Figure 4 for DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation

Abstract:Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbf{DBGroup}, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.

Via

Access Paper or Ask Questions

LESS: Label-Efficient and Single-Stage Referring 3D Segmentation

Oct 17, 2024

Xuexun Liu, Xiaoxu Xu, Jinlong Li, Qiudan Zhang, Xu Wang, Nicu Sebe, Lin Ma

Figure 1 for LESS: Label-Efficient and Single-Stage Referring 3D Segmentation

Figure 2 for LESS: Label-Efficient and Single-Stage Referring 3D Segmentation

Figure 3 for LESS: Label-Efficient and Single-Stage Referring 3D Segmentation

Figure 4 for LESS: Label-Efficient and Single-Stage Referring 3D Segmentation

Abstract:Referring 3D Segmentation is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query. Previous works perform a two-stage paradigm, first conducting language-agnostic instance segmentation then matching with given text query. However, the semantic concepts from text query and visual cues are separately interacted during the training, and both instance and semantic labels for each object are required, which is time consuming and human-labor intensive. To mitigate these issues, we propose a novel Referring 3D Segmentation pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask. Specifically, we design a Point-Word Cross-Modal Alignment module for aligning the fine-grained features of points and textual embedding. Query Mask Predictor module and Query-Sentence Alignment module are introduced for coarse-grained alignment between masks and query. Furthermore, we propose an area regularization loss, which coarsely reduces irrelevant background predictions on a large scale. Besides, a point-to-point contrastive loss is proposed concentrating on distinguishing points with subtly similar features. Through extensive experiments, we achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.

Via

Access Paper or Ask Questions

3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

Jul 13, 2024

Xiaoxu Xu, Yitian Yuan, Jinlong Li, Qiudan Zhang, Zequn Jie, Lin Ma, Hao Tang, Nicu Sebe, Xu Wang

Figure 1 for 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

Figure 2 for 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

Figure 3 for 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

Figure 4 for 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

Abstract:In this paper, we propose 3DSS-VLG, a weakly supervised approach for 3D Semantic Segmentation with 2D Vision-Language Guidance, an alternative approach that a 3D model predicts dense-embedding for each point which is co-embedded with both the aligned image and text spaces from the 2D vision-language model. Specifically, our method exploits the superior generalization ability of the 2D vision-language models and proposes the Embeddings Soft-Guidance Stage to utilize it to implicitly align 3D embeddings and text embeddings. Moreover, we introduce the Embeddings Specialization Stage to purify the feature representation with the help of a given scene-level label, specifying a better feature supervised by the corresponding text embedding. Thus, the 3D model is able to gain informative supervisions both from the image embedding and text embedding, leading to competitive segmentation performances. To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels. Moreover, with extensive quantitative and qualitative experiments, we present that our 3DSS-VLG is able not only to achieve the state-of-the-art performance on both S3DIS and ScanNet datasets, but also to maintain strong generalization capability.

Via

Access Paper or Ask Questions

Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment

Dec 15, 2023

Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang

Figure 1 for Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment

Figure 2 for Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment

Figure 3 for Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment

Figure 4 for Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment

Abstract:Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding. Nevertheless, existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries, which is time-consuming and labor-intensive to obtain. In this paper, we propose \textbf{3D-VLA}, a weakly supervised approach for \textbf{3D} visual grounding based on \textbf{V}isual \textbf{L}inguistic \textbf{A}lignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds with no need for fine-grained box annotations in the training procedure. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images. To the best of our knowledge, this is the first work to investigate 3D visual grounding in a weakly supervised manner by involving large scale vision-language models, and extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even superior results over the fully supervised methods.

Via

Access Paper or Ask Questions