Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiwei Hu

V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

Dec 12, 2025

Ye Fang, Tong Wu, Valentin Deschaintre, Duygu Ceylan, Iliyan Georgiev, Chun-Hao Paul Huang, Yiwei Hu, Xuelin Chen, Tuanfeng Yang Wang

Figure 1 for V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

Figure 2 for V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

Figure 3 for V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

Figure 4 for V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

Abstract:Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.

* Project Page: https://aleafy.github.io/vrgbx

Via

Access Paper or Ask Questions

HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

Aug 12, 2025

Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang

Figure 1 for HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

Figure 2 for HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

Figure 3 for HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

Figure 4 for HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

Abstract:Creating highly detailed SVBRDFs is essential for 3D content creation. The rise of high-resolution text-to-image generative models, based on diffusion transformers (DiT), suggests an opportunity to finetune them for this task. However, retargeting the models to produce multiple aligned SVBRDF maps instead of just RGB images, while achieving high efficiency and ensuring consistency across different maps, remains a challenge. In this paper, we introduce HiMat: a memory- and computation-efficient diffusion-based framework capable of generating native 4K-resolution SVBRDFs. A key challenge we address is maintaining consistency across different maps in a lightweight manner, without relying on training new VAEs or significantly altering the DiT backbone (which would damage its prior capabilities). To tackle this, we introduce the CrossStitch module, a lightweight convolutional module that captures inter-map dependencies through localized operations. Its weights are initialized such that the DiT backbone operation is unchanged before finetuning starts. HiMat enables generation with strong structural coherence and high-frequency details. Results with a large set of text prompts demonstrate the effectiveness of our approach for 4K SVBRDF generation. Further experiments suggest generalization to tasks such as intrinsic decomposition.

Via

Access Paper or Ask Questions

WorldPrompter: Traversable Text-to-Scene Generation

Apr 02, 2025

Zhaoyang Zhang, Yannick Hold-Geoffroy, Miloš Hašan, Chen Ziwen, Fujun Luan, Julie Dorsey, Yiwei Hu

Figure 1 for WorldPrompter: Traversable Text-to-Scene Generation

Figure 2 for WorldPrompter: Traversable Text-to-Scene Generation

Figure 3 for WorldPrompter: Traversable Text-to-Scene Generation

Figure 4 for WorldPrompter: Traversable Text-to-Scene Generation

Abstract:Scene-level 3D generation is a challenging research topic, with most existing methods generating only partial scenes and offering limited navigational freedom. We introduce WorldPrompter, a novel generative pipeline for synthesizing traversable 3D scenes from text prompts. We leverage panoramic videos as an intermediate representation to model the 360{\deg} details of a scene. WorldPrompter incorporates a conditional 360{\deg} panoramic video generator, capable of producing a 128-frame video that simulates a person walking through and capturing a virtual environment. The resulting video is then reconstructed as Gaussian splats by a fast feedforward 3D reconstructor, enabling a true walkable experience within the 3D scene. Experiments demonstrate that our panoramic video generation model achieves convincing view consistency across frames, enabling high-quality panoramic Gaussian splat reconstruction and facilitating traversal over an area of the scene. Qualitative and quantitative results also show it outperforms the state-of-the-art 360{\deg} video generators and 3D scene generation models.

Via

Access Paper or Ask Questions

Turbo3D: Ultra-fast Text-to-3D Generation

Dec 05, 2024

Hanzhe Hu, Tianwei Yin, Fujun Luan, Yiwei Hu, Hao Tan, Zexiang Xu, Sai Bi, Shubham Tulsiani, Kai Zhang

Figure 1 for Turbo3D: Ultra-fast Text-to-3D Generation

Figure 2 for Turbo3D: Ultra-fast Text-to-3D Generation

Figure 3 for Turbo3D: Ultra-fast Text-to-3D Generation

Figure 4 for Turbo3D: Ultra-fast Text-to-3D Generation

Abstract:We present Turbo3D, an ultra-fast text-to-3D system capable of generating high-quality Gaussian splatting assets in under one second. Turbo3D employs a rapid 4-step, 4-view diffusion generator and an efficient feed-forward Gaussian reconstructor, both operating in latent space. The 4-step, 4-view generator is a student model distilled through a novel Dual-Teacher approach, which encourages the student to learn view consistency from a multi-view teacher and photo-realism from a single-view teacher. By shifting the Gaussian reconstructor's inputs from pixel space to latent space, we eliminate the extra image decoding time and halve the transformer sequence length for maximum efficiency. Our method demonstrates superior 3D generation results compared to previous baselines, while operating in a fraction of their runtime.

* project page: https://turbo-3d.github.io/

Via

Access Paper or Ask Questions

MaterialPicker: Multi-Modal Material Generation with Diffusion Transformers

Dec 04, 2024

Xiaohe Ma, Valentin Deschaintre, Miloš Hašan, Fujun Luan, Kun Zhou, Hongzhi Wu, Yiwei Hu

Figure 1 for MaterialPicker: Multi-Modal Material Generation with Diffusion Transformers

Figure 2 for MaterialPicker: Multi-Modal Material Generation with Diffusion Transformers

Figure 3 for MaterialPicker: Multi-Modal Material Generation with Diffusion Transformers

Figure 4 for MaterialPicker: Multi-Modal Material Generation with Diffusion Transformers

Abstract:High-quality material generation is key for virtual environment authoring and inverse rendering. We propose MaterialPicker, a multi-modal material generator leveraging a Diffusion Transformer (DiT) architecture, improving and simplifying the creation of high-quality materials from text prompts and/or photographs. Our method can generate a material based on an image crop of a material sample, even if the captured surface is distorted, viewed at an angle or partially occluded, as is often the case in photographs of natural scenes. We further allow the user to specify a text prompt to provide additional guidance for the generation. We finetune a pre-trained DiT-based video generator into a material generator, where each material map is treated as a frame in a video sequence. We evaluate our approach both quantitatively and qualitatively and show that it enables more diverse material generation and better distortion correction than previous work.

Via

Access Paper or Ask Questions

Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors

Nov 26, 2024

Zhengfei Kuang, Tianyuan Zhang, Kai Zhang, Hao Tan, Sai Bi, Yiwei Hu, Zexiang Xu, Milos Hasan, Gordon Wetzstein, Fujun Luan

Abstract:We present Buffer Anytime, a framework for estimation of depth and normal maps (which we call geometric buffers) from video that eliminates the need for paired video--depth and video--normal training data. Instead of relying on large-scale annotated video datasets, we demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints. Our zero-shot training strategy combines state-of-the-art image estimation models based on optical flow smoothness through a hybrid loss function, implemented via a lightweight temporal attention architecture. Applied to leading image models like Depth Anything V2 and Marigold-E2E-FT, our approach significantly improves temporal consistency while maintaining accuracy. Experiments show that our method not only outperforms image-based approaches but also achieves results comparable to state-of-the-art video models trained on large-scale paired video datasets, despite using no such paired video data.

Via

Access Paper or Ask Questions

RelitLRM: Generative Relightable Radiance for Large Reconstruction Models

Oct 10, 2024

Tianyuan Zhang, Zhengfei Kuang, Haian Jin, Zexiang Xu, Sai Bi, Hao Tan, He Zhang, Yiwei Hu, Milos Hasan, William T. Freeman(+2 more)

Figure 1 for RelitLRM: Generative Relightable Radiance for Large Reconstruction Models

Figure 2 for RelitLRM: Generative Relightable Radiance for Large Reconstruction Models

Figure 3 for RelitLRM: Generative Relightable Radiance for Large Reconstruction Models

Figure 4 for RelitLRM: Generative Relightable Radiance for Large Reconstruction Models

Abstract:We propose RelitLRM, a Large Reconstruction Model (LRM) for generating high-quality Gaussian splatting representations of 3D objects under novel illuminations from sparse (4-8) posed images captured under unknown static lighting. Unlike prior inverse rendering methods requiring dense captures and slow optimization, often causing artifacts like incorrect highlights or shadow baking, RelitLRM adopts a feed-forward transformer-based model with a novel combination of a geometry reconstructor and a relightable appearance generator based on diffusion. The model is trained end-to-end on synthetic multi-view renderings of objects under varying known illuminations. This architecture design enables to effectively decompose geometry and appearance, resolve the ambiguity between material and lighting, and capture the multi-modal distribution of shadows and specularity in the relit appearance. We show our sparse-view feed-forward RelitLRM offers competitive relighting results to state-of-the-art dense-view optimization-based baselines while being significantly faster. Our project page is available at: https://relit-lrm.github.io/.

* webpage: https://relit-lrm.github.io/

Via

Access Paper or Ask Questions

TexSliders: Diffusion-Based Texture Editing in CLIP Space

May 01, 2024

Julia Guerrero-Viu, Milos Hasan, Arthur Roullier, Midhun Harikumar, Yiwei Hu, Paul Guerrero, Diego Gutierrez, Belen Masia, Valentin Deschaintre

Figure 1 for TexSliders: Diffusion-Based Texture Editing in CLIP Space

Figure 2 for TexSliders: Diffusion-Based Texture Editing in CLIP Space

Figure 3 for TexSliders: Diffusion-Based Texture Editing in CLIP Space

Figure 4 for TexSliders: Diffusion-Based Texture Editing in CLIP Space

Abstract:Generative models have enabled intuitive image creation and manipulation using natural language. In particular, diffusion models have recently shown remarkable results for natural image editing. In this work, we propose to apply diffusion techniques to edit textures, a specific class of images that are an essential part of 3D content creation pipelines. We analyze existing editing methods and show that they are not directly applicable to textures, since their common underlying approach, manipulating attention maps, is unsuitable for the texture domain. To address this, we propose a novel approach that instead manipulates CLIP image embeddings to condition the diffusion generation. We define editing directions using simple text prompts (e.g., "aged wood" to "new wood") and map these to CLIP image embedding space using a texture prior, with a sampling-based approach that gives us identity-preserving directions in CLIP space. To further improve identity preservation, we project these directions to a CLIP subspace that minimizes identity variations resulting from entangled texture attributes. Our editing pipeline facilitates the creation of arbitrary sliders using natural language prompts only, with no ground-truth annotated data necessary.

* SIGGRAPH 2024 Conference Proceedings

Via

Access Paper or Ask Questions

RGB$\leftrightarrow$X: Image decomposition and synthesis using material- and lighting-aware diffusion models

May 01, 2024

Zheng Zeng, Valentin Deschaintre, Iliyan Georgiev, Yannick Hold-Geoffroy, Yiwei Hu, Fujun Luan, Ling-Qi Yan, Miloš Hašan

$Figure 1 for RGB$\leftrightarrow$X: Image decomposition and synthesis using material- and lighting-aware diffusion models$

$Figure 2 for RGB$\leftrightarrow$X: Image decomposition and synthesis using material- and lighting-aware diffusion models$

$Figure 3 for RGB$\leftrightarrow$X: Image decomposition and synthesis using material- and lighting-aware diffusion models$

$Figure 4 for RGB$\leftrightarrow$X: Image decomposition and synthesis using material- and lighting-aware diffusion models$

Abstract:The three areas of realistic forward rendering, per-pixel inverse rendering, and generative image synthesis may seem like separate and unrelated sub-fields of graphics and vision. However, recent work has demonstrated improved estimation of per-pixel intrinsic channels (albedo, roughness, metallicity) based on a diffusion architecture; we call this the RGB$\rightarrow$X problem. We further show that the reverse problem of synthesizing realistic images given intrinsic channels, X$\rightarrow$RGB, can also be addressed in a diffusion framework. Focusing on the image domain of interior scenes, we introduce an improved diffusion model for RGB$\rightarrow$X, which also estimates lighting, as well as the first diffusion X$\rightarrow$RGB model capable of synthesizing realistic images from (full or partial) intrinsic channels. Our X$\rightarrow$RGB model explores a middle ground between traditional rendering and generative models: we can specify only certain appearance properties that should be followed, and give freedom to the model to hallucinate a plausible version of the rest. This flexibility makes it possible to use a mix of heterogeneous training datasets, which differ in the available channels. We use multiple existing datasets and extend them with our own synthetic and real data, resulting in a model capable of extracting scene properties better than previous work and of generating highly realistic images of interior scenes.

* SIGGRAPH Conference Papers '24, July 27-August 1, 2024, Denver, CO, USA

Via

Access Paper or Ask Questions