Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eldar Insafutdinov

Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction

Mar 20, 2025

Edgar Sucar, Zihang Lai, Eldar Insafutdinov, Andrea Vedaldi

Abstract:DUSt3R has recently shown that one can reduce many tasks in multi-view geometry, including estimating camera intrinsics and extrinsics, reconstructing the scene in 3D, and establishing image correspondences, to the prediction of a pair of viewpoint-invariant point maps, i.e., pixel-aligned point clouds defined in a common reference frame. This formulation is elegant and powerful, but unable to tackle dynamic scenes. To address this challenge, we introduce the concept of Dynamic Point Maps (DPM), extending standard point maps to support 4D tasks such as motion segmentation, scene flow estimation, 3D object tracking, and 2D correspondence. Our key intuition is that, when time is introduced, there are several possible spatial and time references that can be used to define the point maps. We identify a minimal subset of such combinations that can be regressed by a network to solve the sub tasks mentioned above. We train a DPM predictor on a mixture of synthetic and real data and evaluate it across diverse benchmarks for video depth prediction, dynamic point cloud reconstruction, 3D scene flow and object pose tracking, achieving state-of-the-art performance. Code, models and additional results are available at https://www.robots.ox.ac.uk/~vgg/research/dynamic-point-maps/.

* Web page: https://www.robots.ox.ac.uk/~vgg/research/dynamic-point-maps/

Via

Access Paper or Ask Questions

SEED4D: A Synthetic Ego--Exo Dynamic 4D Data Generator, Driving Dataset and Benchmark

Dec 01, 2024

Marius Kästingschäfer, Théo Gieruc, Sebastian Bernhard, Dylan Campbell, Eldar Insafutdinov, Eyvaz Najafli, Thomas Brox

Figure 1 for SEED4D: A Synthetic Ego--Exo Dynamic 4D Data Generator, Driving Dataset and Benchmark

Figure 2 for SEED4D: A Synthetic Ego--Exo Dynamic 4D Data Generator, Driving Dataset and Benchmark

Figure 3 for SEED4D: A Synthetic Ego--Exo Dynamic 4D Data Generator, Driving Dataset and Benchmark

Figure 4 for SEED4D: A Synthetic Ego--Exo Dynamic 4D Data Generator, Driving Dataset and Benchmark

Abstract:Models for egocentric 3D and 4D reconstruction, including few-shot interpolation and extrapolation settings, can benefit from having images from exocentric viewpoints as supervision signals. No existing dataset provides the necessary mixture of complex, dynamic, and multi-view data. To facilitate the development of 3D and 4D reconstruction methods in the autonomous driving context, we propose a Synthetic Ego--Exo Dynamic 4D (SEED4D) data generator and dataset. We present a customizable, easy-to-use data generator for spatio-temporal multi-view data creation. Our open-source data generator allows the creation of synthetic data for camera setups commonly used in the NuScenes, KITTI360, and Waymo datasets. Additionally, SEED4D encompasses two large-scale multi-view synthetic urban scene datasets. Our static (3D) dataset encompasses 212k inward- and outward-facing vehicle images from 2k scenes, while our dynamic (4D) dataset contains 16.8M images from 10k trajectories, each sampled at 100 points in time with egocentric images, exocentric images, and LiDAR data. The datasets and the data generator can be found at https://seed4d.github.io/.

* WACV 2025. Project page: https://seed4d.github.io/. Code: https://github.com/continental/seed4d

Via

Access Paper or Ask Questions

Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image

Jun 06, 2024

Stanislaw Szymanowicz, Eldar Insafutdinov, Chuanxia Zheng, Dylan Campbell, João F. Henriques, Christian Rupprecht, Andrea Vedaldi

Abstract:In this paper, we propose Flash3D, a method for scene reconstruction and novel view synthesis from a single image which is both very generalisable and efficient. For generalisability, we start from a "foundation" model for monocular depth estimation and extend it to a full 3D shape and appearance reconstructor. For efficiency, we base this extension on feed-forward Gaussian Splatting. Specifically, we predict a first layer of 3D Gaussians at the predicted depth, and then add additional layers of Gaussians that are offset in space, allowing the model to complete the reconstruction behind occlusions and truncations. Flash3D is very efficient, trainable on a single GPU in a day, and thus accessible to most researchers. It achieves state-of-the-art results when trained and tested on RealEstate10k. When transferred to unseen datasets like NYU it outperforms competitors by a large margin. More impressively, when transferred to KITTI, Flash3D achieves better PSNR than methods trained specifically on that dataset. In some instances, it even outperforms recent methods that use multiple views as input. Code, models, demo, and more results are available at https://www.robots.ox.ac.uk/~vgg/research/flash3d/.

* Project page: https://www.robots.ox.ac.uk/~vgg/research/flash3d/

Via

Access Paper or Ask Questions

SNeS: Learning Probably Symmetric Neural Surfaces from Incomplete Data

Jun 13, 2022

Eldar Insafutdinov, Dylan Campbell, João F. Henriques, Andrea Vedaldi

Figure 1 for SNeS: Learning Probably Symmetric Neural Surfaces from Incomplete Data

Figure 2 for SNeS: Learning Probably Symmetric Neural Surfaces from Incomplete Data

Abstract:We present a method for the accurate 3D reconstruction of partly-symmetric objects. We build on the strengths of recent advances in neural reconstruction and rendering such as Neural Radiance Fields (NeRF). A major shortcoming of such approaches is that they fail to reconstruct any part of the object which is not clearly visible in the training image, which is often the case for in-the-wild images and videos. When evidence is lacking, structural priors such as symmetry can be used to complete the missing information. However, exploiting such priors in neural rendering is highly non-trivial: while geometry and non-reflective materials may be symmetric, shadows and reflections from the ambient scene are not symmetric in general. To address this, we apply a soft symmetry constraint to the 3D geometry and material properties, having factored appearance into lighting, albedo colour and reflectivity. We evaluate our method on the recently introduced CO3D dataset, focusing on the car category due to the challenge of reconstructing highly-reflective materials. We show that it can reconstruct unobserved regions with high fidelity and render high-quality novel view images.

* First two authors contributed equally

Via

Access Paper or Ask Questions

Lifting 2D Object Locations to 3D by Discounting LiDAR Outliers across Objects and Views

Oct 09, 2021

Robert McCraith, Eldar Insafutdinov, Lukas Neumann, Andrea Vedaldi

Figure 1 for Lifting 2D Object Locations to 3D by Discounting LiDAR Outliers across Objects and Views

Figure 2 for Lifting 2D Object Locations to 3D by Discounting LiDAR Outliers across Objects and Views

Figure 3 for Lifting 2D Object Locations to 3D by Discounting LiDAR Outliers across Objects and Views

Figure 4 for Lifting 2D Object Locations to 3D by Discounting LiDAR Outliers across Objects and Views

Abstract:We present a system for automatic converting of 2D mask object predictions and raw LiDAR point clouds into full 3D bounding boxes of objects. Because the LiDAR point clouds are partial, directly fitting bounding boxes to the point clouds is meaningless. Instead, we suggest that obtaining good results requires sharing information between \emph{all} objects in the dataset jointly, over multiple frames. We then make three improvements to the baseline. First, we address ambiguities in predicting the object rotations via direct optimization in this space while still backpropagating rotation prediction through the model. Second, we explicitly model outliers and task the network with learning their typical patterns, thus better discounting them. Third, we enforce temporal consistency when video data is available. With these contributions, our method significantly outperforms previous work despite the fact that those methods use significantly more complex pipelines, 3D models and additional human-annotated external sources of prior information.

* ICRA 2022 submission

Via

Access Paper or Ask Questions

360-Degree Textures of People in Clothing from a Single Image

Aug 20, 2019

Verica Lazova, Eldar Insafutdinov, Gerard Pons-Moll

Figure 1 for 360-Degree Textures of People in Clothing from a Single Image

Figure 2 for 360-Degree Textures of People in Clothing from a Single Image

Figure 3 for 360-Degree Textures of People in Clothing from a Single Image

Figure 4 for 360-Degree Textures of People in Clothing from a Single Image

Abstract:In this paper we predict a full 3D avatar of a person from a single image. We infer texture and geometry in the UV-space of the SMPL model using an image-to-image translation method. Given partial texture and segmentation layout maps derived from the input view, our model predicts the complete segmentation map, the complete texture map, and a displacement map. The predicted maps can be applied to the SMPL model in order to naturally generalize to novel poses, shapes, and even new clothing. In order to learn our model in a common UV-space, we non-rigidly register the SMPL model to thousands of 3D scans, effectively encoding textures and geometries as images in correspondence. This turns a difficult 3D inference task into a simpler image-to-image translation one. Results on rendered scans of people and images from the DeepFashion dataset demonstrate that our method can reconstruct plausible 3D avatars from a single image. We further use our model to digitally change pose, shape, swap garments between people and edit clothing. To encourage research in this direction we will make the source code available for research purpose.

Via

Access Paper or Ask Questions

Unsupervised Learning of Shape and Pose with Differentiable Point Clouds

Oct 22, 2018

Eldar Insafutdinov, Alexey Dosovitskiy

Figure 1 for Unsupervised Learning of Shape and Pose with Differentiable Point Clouds

Figure 2 for Unsupervised Learning of Shape and Pose with Differentiable Point Clouds

Figure 3 for Unsupervised Learning of Shape and Pose with Differentiable Point Clouds

Figure 4 for Unsupervised Learning of Shape and Pose with Differentiable Point Clouds

Abstract:We address the problem of learning accurate 3D shape and camera pose from a collection of unlabeled category-specific images. We train a convolutional network to predict both the shape and the pose from a single image by minimizing the reprojection error: given several views of an object, the projections of the predicted shapes to the predicted camera poses should match the provided views. To deal with pose ambiguity, we introduce an ensemble of pose predictors which we then distill to a single "student" model. To allow for efficient learning of high-fidelity shapes, we represent the shapes by point clouds and devise a formulation allowing for differentiable projection of these. Our experiments show that the distilled ensemble of pose predictors learns to estimate the pose accurately, while the point cloud representation allows to predict detailed shape models. The supplementary video can be found at https://www.youtube.com/watch?v=LuIGovKeo60

Via

Access Paper or Ask Questions

PoseTrack: A Benchmark for Human Pose Estimation and Tracking

Apr 10, 2018

Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov, Leonid Pishchulin, Anton Milan, Juergen Gall, Bernt Schiele

Figure 1 for PoseTrack: A Benchmark for Human Pose Estimation and Tracking

Figure 2 for PoseTrack: A Benchmark for Human Pose Estimation and Tracking

Figure 3 for PoseTrack: A Benchmark for Human Pose Estimation and Tracking

Figure 4 for PoseTrack: A Benchmark for Human Pose Estimation and Tracking

Abstract:Human poses and motions are important cues for analysis of videos with people and there is strong evidence that representations based on body pose are highly effective for a variety of tasks such as activity recognition, content retrieval and social signal processing. In this work, we aim to further advance the state of the art by establishing "PoseTrack", a new large-scale benchmark for video-based human pose estimation and articulated tracking, and bringing together the community of researchers working on visual human analysis. The benchmark encompasses three competition tracks focusing on i) single-frame multi-person pose estimation, ii) multi-person pose estimation in videos, and iii) multi-person articulated tracking. To facilitate the benchmark and challenge we collect, annotate and release a new %large-scale benchmark dataset that features videos with multiple people labeled with person tracks and articulated pose. A centralized evaluation server is provided to allow participants to evaluate on a held-out test set. We envision that the proposed benchmark will stimulate productive research both by providing a large and representative training dataset as well as providing a platform to objectively evaluate and compare the proposed methods. The benchmark is freely accessible at https://posetrack.net.

* www.posetrack.net

Via

Access Paper or Ask Questions

ArtTrack: Articulated Multi-person Tracking in the Wild

May 09, 2017

Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu Tang, Evgeny Levinkov, Bjoern Andres, Bernt Schiele

Figure 1 for ArtTrack: Articulated Multi-person Tracking in the Wild

Figure 2 for ArtTrack: Articulated Multi-person Tracking in the Wild

Figure 3 for ArtTrack: Articulated Multi-person Tracking in the Wild

Figure 4 for ArtTrack: Articulated Multi-person Tracking in the Wild

Abstract:In this paper we propose an approach for articulated tracking of multiple people in unconstrained videos. Our starting point is a model that resembles existing architectures for single-frame pose estimation but is substantially faster. We achieve this in two ways: (1) by simplifying and sparsifying the body-part relationship graph and leveraging recent methods for faster inference, and (2) by offloading a substantial share of computation onto a feed-forward convolutional architecture that is able to detect and associate body joints of the same person even in clutter. We use this model to generate proposals for body joint locations and formulate articulated tracking as spatio-temporal grouping of such proposals. This allows to jointly solve the association problem for all people in the scene by propagating evidence from strong detections through time and enforcing constraints that each proposal can be assigned to one person only. We report results on a public MPII Human Pose benchmark and on a new MPII Video Pose dataset of image sequences with multiple people. We demonstrate that our model achieves state-of-the-art results while using only a fraction of time and is able to leverage temporal information to improve state-of-the-art for crowded scenes.

* Accepted to CVPR 2017

Via

Access Paper or Ask Questions

Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications

Feb 21, 2017

Evgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, Bjoern Andres

Figure 1 for Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications

Figure 2 for Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications

Figure 3 for Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications

Figure 4 for Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications

Abstract:We state a combinatorial optimization problem whose feasible solutions define both a decomposition and a node labeling of a given graph. This problem offers a common mathematical abstraction of seemingly unrelated computer vision tasks, including instance-separating semantic segmentation, articulated human body pose estimation and multiple object tracking. Conceptually, the problem we state generalizes the unconstrained integer quadratic program and the minimum cost lifted multicut problem, both of which are NP-hard. In order to find feasible solutions efficiently, we define two local search algorithms that converge monotonously to a local optimum, offering a feasible solution at any time. To demonstrate their effectiveness in tackling computer vision tasks, we apply these algorithms to instances of the problem that we construct from published data, using published algorithms. We report state-of-the-art application-specific accuracy for the three above-mentioned applications.

Via

Access Paper or Ask Questions