Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jaesung Choe

Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration

Feb 23, 2025

Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, Tae-Hyun Oh

Figure 1 for Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration

Figure 2 for Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration

Figure 3 for Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration

Figure 4 for Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration

Abstract:We introduce Dr. Splat, a novel approach for open-vocabulary 3D scene understanding leveraging 3D Gaussian Splatting. Unlike existing language-embedded 3DGS methods, which rely on a rendering process, our method directly associates language-aligned CLIP embeddings with 3D Gaussians for holistic 3D scene understanding. The key of our method is a language feature registration technique where CLIP embeddings are assigned to the dominant Gaussians intersected by each pixel-ray. Moreover, we integrate Product Quantization (PQ) trained on general large-scale image data to compactly represent embeddings without per-scene optimization. Experiments demonstrate that our approach significantly outperforms existing approaches in 3D perception benchmarks, such as open-vocabulary 3D semantic segmentation, 3D object localization, and 3D object selection tasks. For video results, please visit : https://drsplat.github.io/

* 20 pages

Via

Access Paper or Ask Questions

Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

Feb 04, 2025

Junha Lee, Chunghyun Park, Jaesung Choe, Yu-Chiang Frank Wang, Jan Kautz, Minsu Cho, Chris Choy

Figure 1 for Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

Figure 2 for Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

Figure 3 for Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

Figure 4 for Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

Abstract:We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models, we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs, significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.

* project page: https://nvlabs.github.io/Mosaic3D/

Via

Access Paper or Ask Questions

Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering

Dec 05, 2024

Cheng Sun, Jaesung Choe, Charles Loop, Wei-Chiu Ma, Yu-Chiang Frank Wang

Abstract:We propose an efficient radiance field rendering algorithm that incorporates a rasterization process on sparse voxels without neural networks or 3D Gaussians. There are two key contributions coupled with the proposed system. The first is to render sparse voxels in the correct depth order along pixel rays by using dynamic Morton ordering. This avoids the well-known popping artifact found in Gaussian splatting. Second, we adaptively fit sparse voxels to different levels of detail within scenes, faithfully reproducing scene details while achieving high rendering frame rates. Our method improves the previous neural-free voxel grid representation by over 4db PSNR and more than 10x rendering FPS speedup, achieving state-of-the-art comparable novel-view synthesis results. Additionally, our neural-free sparse voxels are seamlessly compatible with grid-based 3D processing algorithms. We achieve promising mesh reconstruction accuracy by integrating TSDF-Fusion and Marching Cubes into our sparse grid system.

* Code release in progress

Via

Access Paper or Ask Questions

Stable Surface Regularization for Fast Few-Shot NeRF

Mar 29, 2024

Byeongin Joung, Byeong-Uk Lee, Jaesung Choe, Ukcheol Shin, Minjun Kang, Taeyeop Lee, In So Kweon, Kuk-Jin Yoon

Abstract:This paper proposes an algorithm for synthesizing novel views under few-shot setup. The main concept is to develop a stable surface regularization technique called Annealing Signed Distance Function (ASDF), which anneals the surface in a coarse-to-fine manner to accelerate convergence speed. We observe that the Eikonal loss - which is a widely known geometric regularization - requires dense training signal to shape different level-sets of SDF, leading to low-fidelity results under few-shot training. In contrast, the proposed surface regularization successfully reconstructs scenes and produce high-fidelity geometry with stable training. Our method is further accelerated by utilizing grid representation and monocular geometric priors. Finally, the proposed approach is up to 45 times faster than existing few-shot novel view synthesis methods, and it produces comparable results in the ScanNet dataset and NeRF-Real dataset.

* 3DV 2024

Via

Access Paper or Ask Questions

MATE: Masked Autoencoders are Online 3D Test-Time Learners

Nov 24, 2022

M. Jehanzeb Mirza, Inkyu Shin, Wei Lin, Andreas Schriebl, Kunyang Sun, Jaesung Choe, Horst Possegger, Mateusz Kozinski, In So Kweon, Kun-Jin Yoon(+1 more)

Figure 1 for MATE: Masked Autoencoders are Online 3D Test-Time Learners

Figure 2 for MATE: Masked Autoencoders are Online 3D Test-Time Learners

Figure 3 for MATE: Masked Autoencoders are Online 3D Test-Time Learners

Figure 4 for MATE: Masked Autoencoders are Online 3D Test-Time Learners

Abstract:We propose MATE, the first Test-Time-Training (TTT) method designed for 3D data. It makes deep networks trained in point cloud classification robust to distribution shifts occurring in test data, which could not be anticipated during training. Like existing TTT methods, which focused on classifying 2D images in the presence of distribution shifts at test-time, MATE also leverages test data for adaptation. Its test-time objective is that of a Masked Autoencoder: Each test point cloud has a large portion of its points removed before it is fed to the network, tasked with reconstructing the full point cloud. Once the network is updated, it is used to classify the point cloud. We test MATE on several 3D object classification datasets and show that it significantly improves robustness of deep networks to several types of corruptions commonly occurring in 3D point clouds. Further, we show that MATE is very efficient in terms of the fraction of points it needs for the adaptation. It can effectively adapt given as few as 5% of tokens of each test sample, which reduces its memory footprint and makes it lightweight. We also highlight that MATE achieves competitive performance by adapting sparingly on the test data, which further reduces its computational overhead, making it ideal for real-time applications.

* Minor fix in citations

Via

Access Paper or Ask Questions

PointMixer: MLP-Mixer for Point Cloud Understanding

Nov 27, 2021

Jaesung Choe, Chunghyun Park, Francois Rameau, Jaesik Park, In So Kweon

Figure 1 for PointMixer: MLP-Mixer for Point Cloud Understanding

Figure 2 for PointMixer: MLP-Mixer for Point Cloud Understanding

Figure 3 for PointMixer: MLP-Mixer for Point Cloud Understanding

Figure 4 for PointMixer: MLP-Mixer for Point Cloud Understanding

Abstract:MLP-Mixer has newly appeared as a new challenger against the realm of CNNs and transformer. Despite its simplicity compared to transformer, the concept of channel-mixing MLPs and token-mixing MLPs achieves noticeable performance in visual recognition tasks. Unlike images, point clouds are inherently sparse, unordered and irregular, which limits the direct use of MLP-Mixer for point cloud understanding. In this paper, we propose PointMixer, a universal point set operator that facilitates information sharing among unstructured 3D points. By simply replacing token-mixing MLPs with a softmax function, PointMixer can "mix" features within/between point sets. By doing so, PointMixer can be broadly used in the network as inter-set mixing, intra-set mixing, and pyramid mixing. Extensive experiments show the competitive or superior performance of PointMixer in semantic segmentation, classification, and point reconstruction against transformer-based methods.

Via

Access Paper or Ask Questions

Facial Depth and Normal Estimation using Single Dual-Pixel Camera

Nov 25, 2021

Minjun Kang, Jaesung Choe, Hyowon Ha, Hae-Gon Jeon, Sunghoon Im, In So Kweon

Figure 1 for Facial Depth and Normal Estimation using Single Dual-Pixel Camera

Figure 2 for Facial Depth and Normal Estimation using Single Dual-Pixel Camera

Figure 3 for Facial Depth and Normal Estimation using Single Dual-Pixel Camera

Figure 4 for Facial Depth and Normal Estimation using Single Dual-Pixel Camera

Abstract:Many mobile manufacturers recently have adopted Dual-Pixel (DP) sensors in their flagship models for faster auto-focus and aesthetic image captures. Despite their advantages, research on their usage for 3D facial understanding has been limited due to the lack of datasets and algorithmic designs that exploit parallax in DP images. This is because the baseline of sub-aperture images is extremely narrow and parallax exists in the defocus blur region. In this paper, we introduce a DP-oriented Depth/Normal network that reconstructs the 3D facial geometry. For this purpose, we collect a DP facial data with more than 135K images for 101 persons captured with our multi-camera structured light systems. It contains the corresponding ground-truth 3D models including depth map and surface normal in metric scale. Our dataset allows the proposed matching network to be generalized for 3D facial depth/normal estimation. The proposed network consists of two novel modules: Adaptive Sampling Module and Adaptive Normal Module, which are specialized in handling the defocus blur in DP images. Finally, the proposed method achieves state-of-the-art performances over recent DP-based depth/normal estimation methods. We also demonstrate the applicability of the estimated depth/normal to face spoofing and relighting.

Via

Access Paper or Ask Questions

UDA-COPE: Unsupervised Domain Adaptation for Category-level Object Pose Estimation

Nov 24, 2021

Taeyeop Lee, Byeong-Uk Lee, Inkyu Shin, Jaesung Choe, Ukcheol Shin, In So Kweon, Kuk-Jin Yoon

Figure 1 for UDA-COPE: Unsupervised Domain Adaptation for Category-level Object Pose Estimation

Figure 2 for UDA-COPE: Unsupervised Domain Adaptation for Category-level Object Pose Estimation

Figure 3 for UDA-COPE: Unsupervised Domain Adaptation for Category-level Object Pose Estimation

Figure 4 for UDA-COPE: Unsupervised Domain Adaptation for Category-level Object Pose Estimation

Abstract:Learning to estimate object pose often requires ground-truth (GT) labels, such as CAD model and absolute-scale object pose, which is expensive and laborious to obtain in the real world. To tackle this problem, we propose an unsupervised domain adaptation (UDA) for category-level object pose estimation, called \textbf{UDA-COPE}. Inspired by the recent multi-modal UDA techniques, the proposed method exploits a teacher-student self-supervised learning scheme to train a pose estimation network without using target domain labels. We also introduce a bidirectional filtering method between predicted normalized object coordinate space (NOCS) map and observed point cloud, to not only make our teacher network more robust to the target domain but also to provide more reliable pseudo labels for the student network training. Extensive experimental results demonstrate the effectiveness of our proposed method both quantitatively and qualitatively. Notably, without leveraging target-domain GT labels, our proposed method achieves comparable or sometimes superior performance to existing methods that depend on the GT labels.

Via

Access Paper or Ask Questions

Deep Point Cloud Reconstruction

Nov 23, 2021

Jaesung Choe, Byeongin Joung, Francois Rameau, Jaesik Park, In So Kweon

Figure 1 for Deep Point Cloud Reconstruction

Figure 2 for Deep Point Cloud Reconstruction

Figure 3 for Deep Point Cloud Reconstruction

Figure 4 for Deep Point Cloud Reconstruction

Abstract:Point cloud obtained from 3D scanning is often sparse, noisy, and irregular. To cope with these issues, recent studies have been separately conducted to densify, denoise, and complete inaccurate point cloud. In this paper, we advocate that jointly solving these tasks leads to significant improvement for point cloud reconstruction. To this end, we propose a deep point cloud reconstruction network consisting of two stages: 1) a 3D sparse stacked-hourglass network as for the initial densification and denoising, 2) a refinement via transformers converting the discrete voxels into 3D points. In particular, we further improve the performance of transformer by a newly proposed module called amplified positional encoding. This module has been designed to differently amplify the magnitude of positional encoding vectors based on the points' distances for adaptive refinements. Extensive experiments demonstrate that our network achieves state-of-the-art performance among the recent studies in the ScanNet, ICL-NUIM, and ShapeNetPart datasets. Moreover, we underline the ability of our network to generalize toward real-world and unmet scenes.

Via

Access Paper or Ask Questions

VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction

Aug 19, 2021

Jaesung Choe, Sunghoon Im, Francois Rameau, Minjun Kang, In So Kweon

Figure 1 for VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction

Figure 2 for VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction

Figure 3 for VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction

Figure 4 for VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction

Abstract:To reconstruct a 3D scene from a set of calibrated views, traditional multi-view stereo techniques rely on two distinct stages: local depth maps computation and global depth maps fusion. Recent studies concentrate on deep neural architectures for depth estimation by using conventional depth fusion method or direct 3D reconstruction network by regressing Truncated Signed Distance Function (TSDF). In this paper, we advocate that replicating the traditional two stages framework with deep neural networks improves both the interpretability and the accuracy of the results. As mentioned, our network operates in two steps: 1) the local computation of the local depth maps with a deep MVS technique, and, 2) the depth maps and images' features fusion to build a single TSDF volume. In order to improve the matching performance between images acquired from very different viewpoints (e.g., large-baseline and rotations), we introduce a rotation-invariant 3D convolution kernel called PosedConv. The effectiveness of the proposed architecture is underlined via a large series of experiments conducted on the ScanNet dataset where our approach compares favorably against both traditional and deep learning techniques.

* ICCV 2021 Accepted

Via

Access Paper or Ask Questions