Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Martin Rünz

Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction

May 01, 2025

Simon Giebenhain, Tobias Kirschstein, Martin Rünz, Lourdes Agapito, Matthias Nießner

Abstract:We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities. Crucially, our benchmark is the first to evaluate both posed and neutral facial geometry. Ultimately, our method outperforms the most competitive baselines by over 15% in terms of geometric accuracy for posed facial expressions.

* Project Website: https://simongiebenhain.github.io/pixel3dmm/ ; Video: https://www.youtube.com/watch?v=BwxwEXJwUDc

Via

Access Paper or Ask Questions

NPGA: Neural Parametric Gaussian Avatars

May 29, 2024

Simon Giebenhain, Tobias Kirschstein, Martin Rünz, Lourdes Agapito, Matthias Nießner

Figure 1 for NPGA: Neural Parametric Gaussian Avatars

Figure 2 for NPGA: Neural Parametric Gaussian Avatars

Figure 3 for NPGA: Neural Parametric Gaussian Avatars

Figure 4 for NPGA: Neural Parametric Gaussian Avatars

Abstract:The creation of high-fidelity, digital versions of human heads is an important stepping stone in the process of further integrating virtual components into our everyday lives. Constructing such avatars is a challenging research problem, due to a high demand for photo-realism and real-time rendering performance. In this work, we propose Neural Parametric Gaussian Avatars (NPGA), a data-driven approach to create high-fidelity, controllable avatars from multi-view video recordings. We build our method around 3D Gaussian Splatting for its highly efficient rendering and to inherit the topological flexibility of point clouds. In contrast to previous work, we condition our avatars' dynamics on the rich expression space of neural parametric head models (NPHM), instead of mesh-based 3DMMs. To this end, we distill the backward deformation field of our underlying NPHM into forward deformations which are compatible with rasterization-based rendering. All remaining fine-scale, expression-dependent details are learned from the multi-view videos. To increase the representational capacity of our avatars, we augment the canonical Gaussian point cloud using per-primitive latent features which govern its dynamic behavior. To regularize this increased dynamic expressivity, we propose Laplacian terms on the latent features and predicted dynamics. We evaluate our method on the public NeRSemble dataset, demonstrating that NPGA significantly outperforms the previous state-of-the-art avatars on the self-reenactment task by 2.6 PSNR. Furthermore, we demonstrate accurate animation capabilities from real-world monocular videos.

* Project Page: see https://simongiebenhain.github.io/NPGA/ ; Youtube Video: see https://www.youtube.com/watch?v=NGRxAYbIkus

Via

Access Paper or Ask Questions

MonoNPHM: Dynamic Head Reconstruction from Monocular Videos

Dec 11, 2023

Simon Giebenhain, Tobias Kirschstein, Markos Georgopoulos, Martin Rünz, Lourdes Agapito, Matthias Nießner

Figure 1 for MonoNPHM: Dynamic Head Reconstruction from Monocular Videos

Figure 2 for MonoNPHM: Dynamic Head Reconstruction from Monocular Videos

Figure 3 for MonoNPHM: Dynamic Head Reconstruction from Monocular Videos

Figure 4 for MonoNPHM: Dynamic Head Reconstruction from Monocular Videos

Abstract:We present Monocular Neural Parametric Head Models (MonoNPHM) for dynamic 3D head reconstructions from monocular RGB videos. To this end, we propose a latent appearance space that parameterizes a texture field on top of a neural parametric model. We constrain predicted color values to be correlated with the underlying geometry such that gradients from RGB effectively influence latent geometry codes during inverse rendering. To increase the representational capacity of our expression space, we augment our backward deformation field with hyper-dimensions, thus improving color and geometry representation in topologically challenging expressions. Using MonoNPHM as a learned prior, we approach the task of 3D head reconstruction using signed distance field based volumetric rendering. By numerically inverting our backward deformation field, we incorporated a landmark loss using facial anchor points that are closely tied to our canonical geometry representation. To evaluate the task of dynamic face reconstruction from monocular RGB videos we record 20 challenging Kinect sequences under casual conditions. MonoNPHM outperforms all baselines with a significant margin, and makes an important step towards easily accessible neural parametric face models through RGB tracking.

* Project Page: see https://simongiebenhain.github.io/MonoNPHM/ ; Video: see https://youtu.be/n-wjaC3UIeE

Via

Access Paper or Ask Questions

HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion

May 11, 2023

Mustafa Işık, Martin Rünz, Markos Georgopoulos, Taras Khakhulin, Jonathan Starck, Lourdes Agapito, Matthias Nießner

Figure 1 for HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion

Figure 2 for HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion

Figure 3 for HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion

Figure 4 for HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion

Abstract:Representing human performance at high-fidelity is an essential building block in diverse applications, such as film production, computer games or videoconferencing. To close the gap to production-level quality, we introduce HumanRF, a 4D dynamic neural scene representation that captures full-body appearance in motion from multi-view video input, and enables playback from novel, unseen viewpoints. Our novel representation acts as a dynamic video encoding that captures fine details at high compression rates by factorizing space-time into a temporal matrix-vector decomposition. This allows us to obtain temporally coherent reconstructions of human actors for long sequences, while representing high-resolution details even in the context of challenging motion. While most research focuses on synthesizing at resolutions of 4MP or lower, we address the challenge of operating at 12MP. To this end, we introduce ActorsHQ, a novel multi-view dataset that provides 12MP footage from 160 cameras for 16 sequences with high-fidelity, per-frame mesh reconstructions. We demonstrate challenges that emerge from using such high-resolution data and show that our newly introduced HumanRF effectively leverages this data, making a significant step towards production-level quality novel view synthesis.

* Project webpage: https://synthesiaresearch.github.io/humanrf Dataset webpage: https://www.actors-hq.com/ Video: https://www.youtube.com/watch?v=OTnhiLLE7io Code: https://github.com/synthesiaresearch/humanrf

Via

Access Paper or Ask Questions

Learning Neural Parametric Head Models

Dec 06, 2022

Simon Giebenhain, Tobias Kirschstein, Markos Georgopoulos, Martin Rünz, Lourdes Agapito, Matthias Nießner

Figure 1 for Learning Neural Parametric Head Models

Figure 2 for Learning Neural Parametric Head Models

Figure 3 for Learning Neural Parametric Head Models

Figure 4 for Learning Neural Parametric Head Models

Abstract:We propose a novel 3D morphable model for complete human heads based on hybrid neural fields. At the core of our model lies a neural parametric representation which disentangles identity and expressions in disjoint latent spaces. To this end, we capture a person's identity in a canonical space as a signed distance field (SDF), and model facial expressions with a neural deformation field. In addition, our representation achieves high-fidelity local detail by introducing an ensemble of local fields centered around facial anchor points. To facilitate generalization, we train our model on a newly-captured dataset of over 2200 head scans from 124 different identities using a custom high-end 3D scanning setup. Our dataset significantly exceeds comparable existing datasets, both with respect to quality and completeness of geometry, averaging around 3.5M mesh faces per scan. Finally, we demonstrate that our approach outperforms state-of-the-art methods by a significant margin in terms of fitting error and reconstruction quality.

* Project Page: https://simongiebenhain.github.io/NPHM ; Project Video: https://www.youtube.com/watch?v=0mDk2tFOJCg

Via

Access Paper or Ask Questions

DSP-SLAM: Object Oriented SLAM with Deep Shape Priors

Aug 21, 2021

Jingwen Wang, Martin Rünz, Lourdes Agapito

Figure 1 for DSP-SLAM: Object Oriented SLAM with Deep Shape Priors

Figure 2 for DSP-SLAM: Object Oriented SLAM with Deep Shape Priors

Figure 3 for DSP-SLAM: Object Oriented SLAM with Deep Shape Priors

Figure 4 for DSP-SLAM: Object Oriented SLAM with Deep Shape Priors

Abstract:We propose DSP-SLAM, an object-oriented SLAM system that builds a rich and accurate joint map of dense 3D models for foreground objects, and sparse landmark points to represent the background. DSP-SLAM takes as input the 3D point cloud reconstructed by a feature-based SLAM system and equips it with the ability to enhance its sparse map with dense reconstructions of detected objects. Objects are detected via semantic instance segmentation, and their shape and pose is estimated using category-specific deep shape embeddings as priors, via a novel second order optimization. Our object-aware bundle adjustment builds a pose-graph to jointly optimize camera poses, object locations and feature points. DSP-SLAM can operate at 10 frames per second on 3 different input modalities: monocular, stereo, or stereo+LiDAR. We demonstrate DSP-SLAM operating at almost frame rate on monocular-RGB sequences from the Friburg and Redwood-OS datasets, and on stereo+LiDAR sequences on the KITTI odometry dataset showing that it achieves high-quality full object reconstructions, even from partial observations, while maintaining a consistent global map. Our evaluation shows improvements in object pose and shape reconstruction with respect to recent deep prior-based reconstruction methods and reductions in camera tracking drift on the KITTI dataset.

Via

Access Paper or Ask Questions

FroDO: From Detections to 3D Objects

May 11, 2020

Kejie Li, Martin Rünz, Meng Tang, Lingni Ma, Chen Kong, Tanner Schmidt, Ian Reid, Lourdes Agapito, Julian Straub, Steven Lovegrove(+1 more)

Figure 1 for FroDO: From Detections to 3D Objects

Figure 2 for FroDO: From Detections to 3D Objects

Figure 3 for FroDO: From Detections to 3D Objects

Figure 4 for FroDO: From Detections to 3D Objects

Abstract:Object-oriented maps are important for scene understanding since they jointly capture geometry and semantics, allow individual instantiation and meaningful reasoning about objects. We introduce FroDO, a method for accurate 3D reconstruction of object instances from RGB video that infers object location, pose and shape in a coarse-to-fine manner. Key to FroDO is to embed object shapes in a novel learnt space that allows seamless switching between sparse point cloud and dense DeepSDF decoding. Given an input sequence of localized RGB frames, FroDO first aggregates 2D detections to instantiate a category-aware 3D bounding box per object. A shape code is regressed using an encoder network before optimizing shape and pose further under the learnt shape priors using sparse and dense shape representations. The optimization uses multi-view geometric, photometric and silhouette losses. We evaluate on real-world datasets, including Pix3D, Redwood-OS, and ScanNet, for single-view, multi-view, and multi-object reconstruction.

* To be published in CVPR 2020. The first two authors contributed equally

Via

Access Paper or Ask Questions

MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects

Oct 22, 2018

Martin Rünz, Maud Buffier, Lourdes Agapito

Figure 1 for MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects

Figure 2 for MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects

Figure 3 for MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects

Figure 4 for MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects

Abstract:We present MaskFusion, a real-time, object-aware, semantic and dynamic RGB-D SLAM system that goes beyond traditional systems which output a purely geometric map of a static scene. MaskFusion recognizes, segments and assigns semantic class labels to different objects in the scene, while tracking and reconstructing them even when they move independently from the camera. As an RGB-D camera scans a cluttered scene, image-based instance-level semantic segmentation creates semantic object masks that enable real-time object recognition and the creation of an object-level representation for the world map. Unlike previous recognition-based SLAM systems, MaskFusion does not require known models of the objects it can recognize, and can deal with multiple independent motions. MaskFusion takes full advantage of using instance-level semantic segmentation to enable semantic labels to be fused into an object-aware map, unlike recent semantics enabled SLAM systems that perform voxel-level semantic segmentation. We show augmented-reality applications that demonstrate the unique features of the map output by MaskFusion: instance-aware, semantic and dynamic.

* Presented at IEEE International Symposium on Mixed and Augmented Reality (ISMAR) 2018

Via

Access Paper or Ask Questions

Co-Fusion: Real-time Segmentation, Tracking and Fusion of Multiple Objects

Jun 20, 2017

Martin Rünz, Lourdes Agapito

Figure 1 for Co-Fusion: Real-time Segmentation, Tracking and Fusion of Multiple Objects

Figure 2 for Co-Fusion: Real-time Segmentation, Tracking and Fusion of Multiple Objects

Figure 3 for Co-Fusion: Real-time Segmentation, Tracking and Fusion of Multiple Objects

Figure 4 for Co-Fusion: Real-time Segmentation, Tracking and Fusion of Multiple Objects

Abstract:In this paper we introduce Co-Fusion, a dense SLAM system that takes a live stream of RGB-D images as input and segments the scene into different objects (using either motion or semantic cues) while simultaneously tracking and reconstructing their 3D shape in real time. We use a multiple model fitting approach where each object can move independently from the background and still be effectively tracked and its shape fused over time using only the information from pixels associated with that object label. Previous attempts to deal with dynamic scenes have typically considered moving regions as outliers, and consequently do not model their shape or track their motion over time. In contrast, we enable the robot to maintain 3D models for each of the segmented objects and to improve them over time through fusion. As a result, our system can enable a robot to maintain a scene description at the object level which has the potential to allow interactions with its working environment; even in the case of dynamic scenes.

* International Conference on Robotics and Automation (ICRA) 2017, http://visual.cs.ucl.ac.uk/pubs/cofusion, https://github.com/martinruenz/co-fusion

Via

Access Paper or Ask Questions