Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Deheng Zhang

StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

May 28, 2025

Nedko Savov, Naser Kazemi, Deheng Zhang, Danda Pani Paudel, Xi Wang, Luc Van Gool

Abstract:World models have recently become promising tools for predicting realistic visuals based on actions in complex environments. However, their reliance on a short sequence of observations causes them to quickly lose track of context. As a result, visual consistency breaks down after just a few steps, and generated scenes no longer reflect information seen earlier. This limitation of the state-of-the-art diffusion-based world models comes from their lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform on long-context tasks by integrating a sequence representation from a state-space model (Mamba), representing the entire interaction history. This design restores long-term memory without sacrificing the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory.

Via

Access Paper or Ask Questions

RISE-SDF: a Relightable Information-Shared Signed Distance Field for Glossy Object Inverse Rendering

Sep 30, 2024

Deheng Zhang, Jingyu Wang, Shaofei Wang, Marko Mihajlovic, Sergey Prokudin, Hendrik P. A. Lensch, Siyu Tang

Figure 1 for RISE-SDF: a Relightable Information-Shared Signed Distance Field for Glossy Object Inverse Rendering

Figure 2 for RISE-SDF: a Relightable Information-Shared Signed Distance Field for Glossy Object Inverse Rendering

Figure 3 for RISE-SDF: a Relightable Information-Shared Signed Distance Field for Glossy Object Inverse Rendering

Figure 4 for RISE-SDF: a Relightable Information-Shared Signed Distance Field for Glossy Object Inverse Rendering

Abstract:In this paper, we propose a novel end-to-end relightable neural inverse rendering system that achieves high-quality reconstruction of geometry and material properties, thus enabling high-quality relighting. The cornerstone of our method is a two-stage approach for learning a better factorization of scene parameters. In the first stage, we develop a reflection-aware radiance field using a neural signed distance field (SDF) as the geometry representation and deploy an MLP (multilayer perceptron) to estimate indirect illumination. In the second stage, we introduce a novel information-sharing network structure to jointly learn the radiance field and the physically based factorization of the scene. For the physically based factorization, to reduce the noise caused by Monte Carlo sampling, we apply a split-sum approximation with a simplified Disney BRDF and cube mipmap as the environment light representation. In the relighting phase, to enhance the quality of indirect illumination, we propose a second split-sum algorithm to trace secondary rays under the split-sum rendering framework.Furthermore, there is no dataset or protocol available to quantitatively evaluate the inverse rendering performance for glossy objects. To assess the quality of material reconstruction and relighting, we have created a new dataset with ground truth BRDF parameters and relighting results. Our experiments demonstrate that our algorithm achieves state-of-the-art performance in inverse rendering and relighting, with particularly strong results in the reconstruction of highly reflective objects.

Via

Access Paper or Ask Questions

CoARF: Controllable 3D Artistic Style Transfer for Radiance Fields

Apr 23, 2024

Deheng Zhang, Clara Fernandez-Labrador, Christopher Schroers

Abstract:Creating artistic 3D scenes can be time-consuming and requires specialized knowledge. To address this, recent works such as ARF, use a radiance field-based approach with style constraints to generate 3D scenes that resemble a style image provided by the user. However, these methods lack fine-grained control over the resulting scenes. In this paper, we introduce Controllable Artistic Radiance Fields (CoARF), a novel algorithm for controllable 3D scene stylization. CoARF enables style transfer for specified objects, compositional 3D style transfer and semantic-aware style transfer. We achieve controllability using segmentation masks with different label-dependent loss functions. We also propose a semantic-aware nearest neighbor matching algorithm to improve the style transfer quality. Our extensive experiments demonstrate that CoARF provides user-specified controllability of style transfer and superior style transfer quality with more precise feature matching.

* International Conference on 3D Vision 2024

Via

Access Paper or Ask Questions

Point-Based Radiance Fields for Controllable Human Motion Synthesis

Oct 05, 2023

Haitao Yu, Deheng Zhang, Peiyuan Xie, Tianyi Zhang

Figure 1 for Point-Based Radiance Fields for Controllable Human Motion Synthesis

Figure 2 for Point-Based Radiance Fields for Controllable Human Motion Synthesis

Figure 3 for Point-Based Radiance Fields for Controllable Human Motion Synthesis

Figure 4 for Point-Based Radiance Fields for Controllable Human Motion Synthesis

Abstract:This paper proposes a novel controllable human motion synthesis method for fine-level deformation based on static point-based radiance fields. Although previous editable neural radiance field methods can generate impressive results on novel-view synthesis and allow naive deformation, few algorithms can achieve complex 3D human editing such as forward kinematics. Our method exploits the explicit point cloud to train the static 3D scene and apply the deformation by encoding the point cloud translation using a deformation MLP. To make sure the rendering result is consistent with the canonical space training, we estimate the local rotation using SVD and interpolate the per-point rotation to the query view direction of the pre-trained radiance field. Extensive experiments show that our approach can significantly outperform the state-of-the-art on fine-level complex deformation which can be generalized to other 3D characters besides humans.

Via

Access Paper or Ask Questions

NICE-SLAM with Adaptive Feature Grids

Jun 10, 2023

Ganlin Zhang, Deheng Zhang, Feichi Lu, Anqi Li

Figure 1 for NICE-SLAM with Adaptive Feature Grids

Figure 2 for NICE-SLAM with Adaptive Feature Grids

Figure 3 for NICE-SLAM with Adaptive Feature Grids

Figure 4 for NICE-SLAM with Adaptive Feature Grids

Abstract:NICE-SLAM is a dense visual SLAM system that combines the advantages of neural implicit representations and hierarchical grid-based scene representation. However, the hierarchical grid features are densely stored, leading to memory explosion problems when adapting the framework to large scenes. In our project, we present sparse NICE-SLAM, a sparse SLAM system incorporating the idea of Voxel Hashing into NICE-SLAM framework. Instead of initializing feature grids in the whole space, voxel features near the surface are adaptively added and optimized. Experiments demonstrated that compared to NICE-SLAM algorithm, our approach takes much less memory and achieves comparable reconstruction quality on the same datasets. Our implementation is available at https://github.com/zhangganlin/NICE-SLAM-with-Adaptive-Feature-Grids.

* This is a course project, not suitable for a preprint platform

Via

Access Paper or Ask Questions

Accessible Robot Control in Mixed Reality

Jun 04, 2023

Ganlin Zhang, Deheng Zhang, Longteng Duan, Guo Han

Figure 1 for Accessible Robot Control in Mixed Reality

Figure 2 for Accessible Robot Control in Mixed Reality

Figure 3 for Accessible Robot Control in Mixed Reality

Figure 4 for Accessible Robot Control in Mixed Reality

Abstract:A novel method to control the Spot robot of Boston Dynamics by Hololens 2 is proposed. This method is mainly designed for people with physical disabilities, users can control the robot's movement and robot arm without using their hands. The eye gaze tracking and head motion tracking technologies of Hololens 2 are utilized for sending control commands. The movement of the robot would follow the eye gaze and the robot arm would mimic the pose of the user's head. Through our experiment, our method is comparable with the traditional control method by joystick in both time efficiency and user experience. Demo can be found on our project webpage: https://zhangganlin.github.io/Holo-Spot-Page/index.html

* Course Project of Mixed Reality at ETH Zurich

Via

Access Paper or Ask Questions