Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yongwei Chen

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

Nov 25, 2024

Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, XIngang Pan

Abstract:Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content. Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.

* Project page: https://cyw-3d.github.io/projects/SAR3D/

Via

Access Paper or Ask Questions

MvDrag3D: Drag-based Creative 3D Editing via Multi-view Generation-Reconstruction Priors

Oct 21, 2024

Honghua Chen, Yushi Lan, Yongwei Chen, Yifan Zhou, Xingang Pan

Abstract:Drag-based editing has become popular in 2D content creation, driven by the capabilities of image generative models. However, extending this technique to 3D remains a challenge. Existing 3D drag-based editing methods, whether employing explicit spatial transformations or relying on implicit latent optimization within limited-capacity 3D generative models, fall short in handling significant topology changes or generating new textures across diverse object categories. To overcome these limitations, we introduce MVDrag3D, a novel framework for more flexible and creative drag-based 3D editing that leverages multi-view generation and reconstruction priors. At the core of our approach is the usage of a multi-view diffusion model as a strong generative prior to perform consistent drag editing over multiple rendered views, which is followed by a reconstruction model that reconstructs 3D Gaussians of the edited object. While the initial 3D Gaussians may suffer from misalignment between different views, we address this via view-specific deformation networks that adjust the position of Gaussians to be well aligned. In addition, we propose a multi-view score function that distills generative priors from multiple views to further enhance the view consistency and visual quality. Extensive experiments demonstrate that MVDrag3D provides a precise, generative, and flexible solution for 3D drag-based editing, supporting more versatile editing effects across various object categories and 3D representations.

* 16 pages, 10 figures, conference

Via

Access Paper or Ask Questions

ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

Mar 19, 2024

Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, Ziwei Liu

Figure 1 for ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

Figure 2 for ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

Figure 3 for ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

Figure 4 for ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

Abstract:Generating high-quality 3D assets from a given image is highly desirable in various applications such as AR/VR. Recent advances in single-image 3D generation explore feed-forward models that learn to infer the 3D model of an object without optimization. Though promising results have been achieved in single object generation, these methods often struggle to model complex 3D assets that inherently contain multiple objects. In this work, we present ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models. 1) We first perform an in-depth analysis of this ``multi-object gap'' from both model and data perspectives. 2) Next, with reconstructed 3D models of different objects, we seek to adjust their sizes, rotation angles, and locations to create a 3D asset that matches the given image. 3) To automate this process, we apply spatially-aware score distillation sampling (SSDS) from pretrained diffusion models to guide the positioning of objects. Our proposed framework emphasizes spatial alignment of objects, compared with standard score distillation sampling, and thus achieves more accurate results. Extensive experiments validate ComboVerse achieves clear improvements over existing methods in generating compositional 3D assets.

* https://cyw-3d.github.io/ComboVerse/

Via

Access Paper or Ask Questions

Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation

Mar 28, 2023

Rui Chen, Yongwei Chen, Ningxin Jiao, Kui Jia

Abstract:Automatic 3D content creation has achieved rapid progress recently due to the availability of pre-trained, large language models and image diffusion models, forming the emerging topic of text-to-3D content creation. Existing text-to-3D methods commonly use implicit scene representations, which couple the geometry and appearance via volume rendering and are suboptimal in terms of recovering finer geometries and achieving photorealistic rendering; consequently, they are less effective for generating high-quality 3D assets. In this work, we propose a new method of Fantasia3D for high-quality text-to-3D content creation. Key to Fantasia3D is the disentangled modeling and learning of geometry and appearance. For geometry learning, we rely on a hybrid scene representation, and propose to encode surface normal extracted from the representation as the input of the image diffusion model. For appearance modeling, we introduce the spatially varying bidirectional reflectance distribution function (BRDF) into the text-to-3D task, and learn the surface material for photorealistic rendering of the generated surface. Our disentangled framework is more compatible with popular graphics engines, supporting relighting, editing, and physical simulation of the generated 3D assets. We conduct thorough experiments that show the advantages of our method over existing ones under different text-to-3D task settings. Project page and source codes: https://fantasia3d.github.io/.

* Project page: https://fantasia3d.github.io/

Via

Access Paper or Ask Questions

TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition

Oct 20, 2022

Yongwei Chen, Rui Chen, Jiabao Lei, Yabin Zhang, Kui Jia

Figure 1 for TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition

Figure 2 for TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition

Figure 3 for TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition

Figure 4 for TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition

Abstract:Creation of 3D content by stylization is a promising yet challenging problem in computer vision and graphics research. In this work, we focus on stylizing photorealistic appearance renderings of a given surface mesh of arbitrary topology. Motivated by the recent surge of cross-modal supervision of the Contrastive Language-Image Pre-training (CLIP) model, we propose TANGO, which transfers the appearance style of a given 3D shape according to a text prompt in a photorealistic manner. Technically, we propose to disentangle the appearance style as the spatially varying bidirectional reflectance distribution function, the local geometric variation, and the lighting condition, which are jointly optimized, via supervision of the CLIP loss, by a spherical Gaussians based differentiable renderer. As such, TANGO enables photorealistic 3D style transfer by automatically predicting reflectance effects even for bare, low-quality meshes, without training on a task-specific dataset. Extensive experiments show that TANGO outperforms existing methods of text-driven 3D style transfer in terms of photorealistic quality, consistency of 3D geometry, and robustness when stylizing low-quality meshes. Our codes and results are available at our project webpage https://cyw-3d.github.io/tango/.

* Accepted by NeurIPS 2022

Via

Access Paper or Ask Questions

Coarse Retinal Lesion Annotations Refinement via Prototypical Learning

Aug 30, 2022

Qinji Yu, Kang Dang, Ziyu Zhou, Yongwei Chen, Xiaowei Ding

Figure 1 for Coarse Retinal Lesion Annotations Refinement via Prototypical Learning

Figure 2 for Coarse Retinal Lesion Annotations Refinement via Prototypical Learning

Figure 3 for Coarse Retinal Lesion Annotations Refinement via Prototypical Learning

Figure 4 for Coarse Retinal Lesion Annotations Refinement via Prototypical Learning

Abstract:Deep-learning-based approaches for retinal lesion segmentation often require an abundant amount of precise pixel-wise annotated data. However, coarse annotations such as circles or ellipses for outlining the lesion area can be six times more efficient than pixel-level annotation. Therefore, this paper proposes an annotation refinement network to convert a coarse annotation into a pixel-level segmentation mask. Our main novelty is the application of the prototype learning paradigm to enhance the generalization ability across different datasets or types of lesions. We also introduce a prototype weighing module to handle challenging cases where the lesion is overly small. The proposed method was trained on the publicly available IDRiD dataset and then generalized to the public DDR and our real-world private datasets. Experiments show that our approach substantially improved the initial coarse mask and outperformed the non-prototypical baseline by a large margin. Moreover, we demonstrate the usefulness of the prototype weighing module in both cross-dataset and cross-class settings.

* MICCAI22 workshop MLMI 2022

Via

Access Paper or Ask Questions

Masked Surfel Prediction for Self-Supervised Point Cloud Learning

Jul 07, 2022

Yabin Zhang, Jiehong Lin, Chenhang He, Yongwei Chen, Kui Jia, Lei Zhang

Figure 1 for Masked Surfel Prediction for Self-Supervised Point Cloud Learning

Figure 2 for Masked Surfel Prediction for Self-Supervised Point Cloud Learning

Figure 3 for Masked Surfel Prediction for Self-Supervised Point Cloud Learning

Figure 4 for Masked Surfel Prediction for Self-Supervised Point Cloud Learning

Abstract:Masked auto-encoding is a popular and effective self-supervised learning approach to point cloud learning. However, most of the existing methods reconstruct only the masked points and overlook the local geometry information, which is also important to understand the point cloud data. In this work, we make the first attempt, to the best of our knowledge, to consider the local geometry information explicitly into the masked auto-encoding, and propose a novel Masked Surfel Prediction (MaskSurf) method. Specifically, given the input point cloud masked at a high ratio, we learn a transformer-based encoder-decoder network to estimate the underlying masked surfels by simultaneously predicting the surfel positions (i.e., points) and per-surfel orientations (i.e., normals). The predictions of points and normals are supervised by the Chamfer Distance and a newly introduced Position-Indexed Normal Distance in a set-to-set manner. Our MaskSurf is validated on six downstream tasks under three fine-tuning strategies. In particular, MaskSurf outperforms its closest competitor, Point-MAE, by 1.2\% on the real-world dataset of ScanObjectNN under the OBJ-BG setting, justifying the advantages of masked surfel prediction over masked point cloud reconstruction. Codes will be available at https://github.com/YBZh/MaskSurf.

* Codes will be available at https://github.com/YBZh/MaskSurf

Via

Access Paper or Ask Questions

Quasi-Balanced Self-Training on Noise-Aware Synthesis of Object Point Clouds for Closing Domain Gap

Mar 08, 2022

Yongwei Chen, Zihao Wang, Longkun Zou, Ke Chen, Kui Jia

Figure 1 for Quasi-Balanced Self-Training on Noise-Aware Synthesis of Object Point Clouds for Closing Domain Gap

Figure 2 for Quasi-Balanced Self-Training on Noise-Aware Synthesis of Object Point Clouds for Closing Domain Gap

Figure 3 for Quasi-Balanced Self-Training on Noise-Aware Synthesis of Object Point Clouds for Closing Domain Gap

Figure 4 for Quasi-Balanced Self-Training on Noise-Aware Synthesis of Object Point Clouds for Closing Domain Gap

Abstract:Semantic analyses of object point clouds are largely driven by releasing of benchmarking datasets, including synthetic ones whose instances are sampled from object CAD models. However, learning from synthetic data may not generalize to practical scenarios, where point clouds are typically incomplete, non-uniformly distributed, and noisy. Such a challenge of Simulation-to-Real (Sim2Real) domain gap could be mitigated via learning algorithms of domain adaptation; however, we argue that generation of synthetic point clouds via more physically realistic rendering is a powerful alternative, as systematic non-uniform noise patterns can be captured. To this end, we propose an integrated scheme consisting of physically realistic synthesis of object point clouds via rendering stereo images via projection of speckle patterns onto CAD models and a novel quasi-balanced self-training designed for more balanced data distribution by sparsity-driven selection of pseudo labeled samples for long tailed classes. Experiment results can verify the effectiveness of our method as well as both of its modules for unsupervised domain adaptation on point cloud classification, achieving the state-of-the-art performance.

Via

Access Paper or Ask Questions

Universality of parametric Coupling Flows over parametric diffeomorphisms

Feb 08, 2022

Junlong Lyu, Zhitang Chen, Chang Feng, Wenjing Cun, Shengyu Zhu, Yanhui Geng, Zhijie Xu, Yongwei Chen

Figure 1 for Universality of parametric Coupling Flows over parametric diffeomorphisms

Figure 2 for Universality of parametric Coupling Flows over parametric diffeomorphisms

Figure 3 for Universality of parametric Coupling Flows over parametric diffeomorphisms

Figure 4 for Universality of parametric Coupling Flows over parametric diffeomorphisms

Abstract:Invertible neural networks based on Coupling Flows CFlows) have various applications such as image synthesis and data compression. The approximation universality for CFlows is of paramount importance to ensure the model expressiveness. In this paper, we prove that CFlows can approximate any diffeomorphism in C^k-norm if its layers can approximate certain single-coordinate transforms. Specifically, we derive that a composition of affine coupling layers and invertible linear transforms achieves this universality. Furthermore, in parametric cases where the diffeomorphism depends on some extra parameters, we prove the corresponding approximation theorems for our proposed parametric coupling flows named Para-CFlows. In practice, we apply Para-CFlows as a neural surrogate model in contextual Bayesian optimization tasks, to demonstrate its superiority over other neural surrogate models in terms of optimization performance.

* 22 pages, 6 figures

Via

Access Paper or Ask Questions

Deep Optimized Priors for 3D Shape Modeling and Reconstruction

Dec 14, 2020

Mingyue Yang, Yuxin Wen, Weikai Chen, Yongwei Chen, Kui Jia

Figure 1 for Deep Optimized Priors for 3D Shape Modeling and Reconstruction

Figure 2 for Deep Optimized Priors for 3D Shape Modeling and Reconstruction

Figure 3 for Deep Optimized Priors for 3D Shape Modeling and Reconstruction

Figure 4 for Deep Optimized Priors for 3D Shape Modeling and Reconstruction

Abstract:Many learning-based approaches have difficulty scaling to unseen data, as the generality of its learned prior is limited to the scale and variations of the training samples. This holds particularly true with 3D learning tasks, given the sparsity of 3D datasets available. We introduce a new learning framework for 3D modeling and reconstruction that greatly improves the generalization ability of a deep generator. Our approach strives to connect the good ends of both learning-based and optimization-based methods. In particular, unlike the common practice that fixes the pre-trained priors at test time, we propose to further optimize the learned prior and latent code according to the input physical measurements after the training. We show that the proposed strategy effectively breaks the barriers constrained by the pre-trained priors and could lead to high-quality adaptation to unseen data. We realize our framework using the implicit surface representation and validate the efficacy of our approach in a variety of challenging tasks that take highly sparse or collapsed observations as input. Experimental results show that our approach compares favorably with the state-of-the-art methods in terms of both generality and accuracy.

Via

Access Paper or Ask Questions