Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiaxiang Tang

SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

Mar 12, 2026

Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng

Abstract:Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant

* Code: https://github.com/ROUJINN/SceneAssistant

Via

Access Paper or Ask Questions

Efficient Part-level 3D Object Generation via Dual Volume Packing

Jun 11, 2025

Jiaxiang Tang, Ruijie Lu, Zhaoshuo Li, Zekun Hao, Xuan Li, Fangyin Wei, Shuran Song, Gang Zeng, Ming-Yu Liu, Tsung-Yi Lin

Abstract:Recent progress in 3D object generation has greatly improved both the quality and efficiency. However, most existing methods generate a single mesh with all parts fused together, which limits the ability to edit or manipulate individual parts. A key challenge is that different objects may have a varying number of parts. To address this, we propose a new end-to-end framework for part-level 3D object generation. Given a single input image, our method generates high-quality 3D objects with an arbitrary number of complete and semantically meaningful parts. We introduce a dual volume packing strategy that organizes all parts into two complementary volumes, allowing for the creation of complete and interleaved parts that assemble into the final object. Experiments show that our model achieves better quality, diversity, and generalization than previous image-based part-level generation methods.

* Code: https://github.com/NVlabs/PartPacker Project Page: https://research.nvidia.com/labs/dir/partpacker/

Via

Access Paper or Ask Questions

TACO: Taming Diffusion for in-the-wild Video Amodal Completion

Mar 15, 2025

Ruijie Lu, Yixin Chen, Yu Liu, Jiaxiang Tang, Junfeng Ni, Diwen Wan, Gang Zeng, Siyuan Huang

Abstract:Humans can infer complete shapes and appearances of objects from limited visual cues, relying on extensive prior knowledge of the physical world. However, completing partially observable objects while ensuring consistency across video frames remains challenging for existing models, especially for unstructured, in-the-wild videos. This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video given a visual prompt specifying the object of interest. Leveraging the rich, consistent manifolds learned by pre-trained video diffusion models, we propose a conditional diffusion model, TACO, that repurposes these manifolds for VAC. To enable its effective and robust generalization to challenging in-the-wild scenarios, we curate a large-scale synthetic dataset with multiple difficulty levels by systematically imposing occlusions onto un-occluded videos. Building on this, we devise a progressive fine-tuning paradigm that starts with simpler recovery tasks and gradually advances to more complex ones. We demonstrate TACO's versatility on a wide range of in-the-wild videos from Internet, as well as on diverse, unseen datasets commonly used in autonomous driving, robotic manipulation, and scene understanding. Moreover, we show that TACO can be effectively applied to various downstream tasks like object reconstruction and pose estimation, highlighting its potential to facilitate physical world understanding and reasoning. Our project page is available at https://jason-aplp.github.io/TACO.

* Project page: https://jason-aplp.github.io/TACO

Via

Access Paper or Ask Questions

Robust Multiple Description Neural Video Codec with Masked Transformer for Dynamic and Noisy Networks

Dec 10, 2024

Xinyue Hu, Wei Ye, Jiaxiang Tang, Eman Ramadan, Zhi-Li Zhang

Abstract:Multiple Description Coding (MDC) is a promising error-resilient source coding method that is particularly suitable for dynamic networks with multiple (yet noisy and unreliable) paths. However, conventional MDC video codecs suffer from cumbersome architectures, poor scalability, limited loss resilience, and lower compression efficiency. As a result, MDC has never been widely adopted. Inspired by the potential of neural video codecs, this paper rethinks MDC design. We propose a novel MDC video codec, NeuralMDC, demonstrating how bidirectional transformers trained for masked token prediction can vastly simplify the design of MDC video codec. To compress a video, NeuralMDC starts by tokenizing each frame into its latent representation and then splits the latent tokens to create multiple descriptions containing correlated information. Instead of using motion prediction and warping operations, NeuralMDC trains a bidirectional masked transformer to model the spatial-temporal dependencies of latent representations and predict the distribution of the current representation based on the past. The predicted distribution is used to independently entropy code each description and infer any potentially lost tokens. Extensive experiments demonstrate NeuralMDC achieves state-of-the-art loss resilience with minimal sacrifices in compression efficiency, significantly outperforming the best existing residual-coding-based error-resilient neural video codec.

* 10 pages

Via

Access Paper or Ask Questions

EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation

Sep 26, 2024

Jiaxiang Tang, Zhaoshuo Li, Zekun Hao, Xian Liu, Gang Zeng, Ming-Yu Liu, Qinsheng Zhang

Figure 1 for EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation

Figure 2 for EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation

Figure 3 for EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation

Figure 4 for EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation

Abstract:Current auto-regressive mesh generation methods suffer from issues such as incompleteness, insufficient detail, and poor generalization. In this paper, we propose an Auto-regressive Auto-encoder (ArAE) model capable of generating high-quality 3D meshes with up to 4,000 faces at a spatial resolution of $512^3$. We introduce a novel mesh tokenization algorithm that efficiently compresses triangular meshes into 1D token sequences, significantly enhancing training efficiency. Furthermore, our model compresses variable-length triangular meshes into a fixed-length latent space, enabling training latent diffusion models for better generalization. Extensive experiments demonstrate the superior quality, diversity, and generalization capabilities of our model in both point cloud and image-conditioned mesh generation tasks.

* Project Page: https://research.nvidia.com/labs/dir/edgerunner/

Via

Access Paper or Ask Questions

MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers

Jun 14, 2024

Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Jiaxiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu(+2 more)

Figure 1 for MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers

Figure 2 for MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers

Figure 3 for MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers

Figure 4 for MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers

Abstract:Recently, 3D assets created via reconstruction and generation have matched the quality of manually crafted assets, highlighting their potential for replacement. However, this potential is largely unrealized because these assets always need to be converted to meshes for 3D industry applications, and the meshes produced by current mesh extraction methods are significantly inferior to Artist-Created Meshes (AMs), i.e., meshes created by human artists. Specifically, current mesh extraction methods rely on dense faces and ignore geometric features, leading to inefficiencies, complicated post-processing, and lower representation quality. To address these issues, we introduce MeshAnything, a model that treats mesh extraction as a generation problem, producing AMs aligned with specified shapes. By converting 3D assets in any 3D representation into AMs, MeshAnything can be integrated with various 3D asset production methods, thereby enhancing their application across the 3D industry. The architecture of MeshAnything comprises a VQ-VAE and a shape-conditioned decoder-only transformer. We first learn a mesh vocabulary using the VQ-VAE, then train the shape-conditioned decoder-only transformer on this vocabulary for shape-conditioned autoregressive mesh generation. Our extensive experiments show that our method generates AMs with hundreds of times fewer faces, significantly improving storage, rendering, and simulation efficiencies, while achieving precision comparable to previous methods.

* Project Page: https://buaacyw.github.io/mesh-anything/ Code: https://github.com/buaacyw/MeshAnything

Via

Access Paper or Ask Questions

InTeX: Interactive Text-to-texture Synthesis via Unified Depth-aware Inpainting

Mar 18, 2024

Jiaxiang Tang, Ruijie Lu, Xiaokang Chen, Xiang Wen, Gang Zeng, Ziwei Liu

Abstract:Text-to-texture synthesis has become a new frontier in 3D content creation thanks to the recent advances in text-to-image models. Existing methods primarily adopt a combination of pretrained depth-aware diffusion and inpainting models, yet they exhibit shortcomings such as 3D inconsistency and limited controllability. To address these challenges, we introduce InteX, a novel framework for interactive text-to-texture synthesis. 1) InteX includes a user-friendly interface that facilitates interaction and control throughout the synthesis process, enabling region-specific repainting and precise texture editing. 2) Additionally, we develop a unified depth-aware inpainting model that integrates depth information with inpainting cues, effectively mitigating 3D inconsistencies and improving generation speed. Through extensive experiments, our framework has proven to be both practical and effective in text-to-texture synthesis, paving the way for high-quality 3D content creation.

* Project Page: https://me.kiui.moe/intex/

Via

Access Paper or Ask Questions

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

Mar 04, 2024

Fangzhou Hong, Jiaxiang Tang, Ziang Cao, Min Shi, Tong Wu, Zhaoxi Chen, Tengfei Wang, Liang Pan, Dahua Lin, Ziwei Liu

Figure 1 for 3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

Figure 2 for 3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

Figure 3 for 3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

Figure 4 for 3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

Abstract:We present a two-stage text-to-3D generation system, namely 3DTopia, which generates high-quality general 3D assets within 5 minutes using hybrid diffusion priors. The first stage samples from a 3D diffusion prior directly learned from 3D data. Specifically, it is powered by a text-conditioned tri-plane latent diffusion model, which quickly generates coarse 3D samples for fast prototyping. The second stage utilizes 2D diffusion priors to further refine the texture of coarse 3D models from the first stage. The refinement consists of both latent and pixel space optimization for high-quality texture generation. To facilitate the training of the proposed system, we clean and caption the largest open-source 3D dataset, Objaverse, by combining the power of vision language models and large language models. Experiment results are reported qualitatively and quantitatively to show the performance of the proposed system. Our codes and models are available at https://github.com/3DTopia/3DTopia

* Code available at https://github.com/3DTopia/3DTopia

Via

Access Paper or Ask Questions

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

Feb 07, 2024

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, Ziwei Liu

Abstract:3D content creation has achieved significant progress in terms of both quality and speed. Although current feed-forward models can produce 3D objects in seconds, their resolution is constrained by the intensive computation required during training. In this paper, we introduce Large Multi-View Gaussian Model (LGM), a novel framework designed to generate high-resolution 3D models from text prompts or single-view images. Our key insights are two-fold: 1) 3D Representation: We propose multi-view Gaussian features as an efficient yet powerful representation, which can then be fused together for differentiable rendering. 2) 3D Backbone: We present an asymmetric U-Net as a high-throughput backbone operating on multi-view images, which can be produced from text or single-view image input by leveraging multi-view diffusion models. Extensive experiments demonstrate the high fidelity and efficiency of our approach. Notably, we maintain the fast speed to generate 3D objects within 5 seconds while boosting the training resolution to 512, thereby achieving high-resolution 3D content generation.

* Project page: https://me.kiui.moe/lgm/

Via

Access Paper or Ask Questions

DreamGaussian4D: Generative 4D Gaussian Splatting

Dec 29, 2023

Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, Ziwei Liu

Abstract:Remarkable progress has been made in 4D content generation recently. However, existing methods suffer from long optimization time, lack of motion controllability, and a low level of detail. In this paper, we introduce DreamGaussian4D, an efficient 4D generation framework that builds on 4D Gaussian Splatting representation. Our key insight is that the explicit modeling of spatial transformations in Gaussian Splatting makes it more suitable for the 4D generation setting compared with implicit representations. DreamGaussian4D reduces the optimization time from several hours to just a few minutes, allows flexible control of the generated 3D motion, and produces animated meshes that can be efficiently rendered in 3D engines.

* Technical report. Project page is at https://jiawei-ren.github.io/projects/dreamgaussian4d Code is at https://github.com/jiawei-ren/dreamgaussian4d

Via

Access Paper or Ask Questions