Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jan Eric Lenssen

Spatial Reasoning with Denoising Models

Feb 28, 2025

Christopher Wewer, Bart Pogodzinski, Bernt Schiele, Jan Eric Lenssen

Abstract:We introduce Spatial Reasoning Models (SRMs), a framework to perform reasoning over sets of continuous variables via denoising generative models. SRMs infer continuous representations on a set of unobserved variables, given observations on observed variables. Current generative models on spatial domains, such as diffusion and flow matching models, often collapse to hallucination in case of complex distributions. To measure this, we introduce a set of benchmark tasks that test the quality of complex reasoning in generative models and can quantify hallucination. The SRM framework allows to report key findings about importance of sequentialization in generation, the associated order, as well as the sampling strategies during training. It demonstrates, for the first time, that order of generation can successfully be predicted by the denoising network itself. Using these findings, we can increase the accuracy of specific reasoning tasks from <1% to >50%.

* Project website: https://geometric-rl.mpi-inf.mpg.de/srm/

Via

Access Paper or Ask Questions

MEt3R: Measuring Multi-View Consistency in Generated Images

Jan 10, 2025

Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, Jan Eric Lenssen

Abstract:We introduce MEt3R, a metric for multi-view consistency in generated images. Large-scale generative models for multi-view image generation are rapidly advancing the field of 3D inference from sparse observations. However, due to the nature of generative modeling, traditional reconstruction metrics are not suitable to measure the quality of generated outputs and metrics that are independent of the sampling procedure are desperately needed. In this work, we specifically address the aspect of consistency between generated multi-view images, which can be evaluated independently of the specific scene. Our approach uses DUSt3R to obtain dense 3D reconstructions from image pairs in a feed-forward manner, which are used to warp image contents from one view into the other. Then, feature maps of these images are compared to obtain a similarity score that is invariant to view-dependent effects. Using MEt3R, we evaluate the consistency of a large set of previous methods for novel view and video generation, including our open, multi-view latent diffusion model.

* Project website: https://geometric-rl.mpi-inf.mpg.de/met3r/

Via

Access Paper or Ask Questions

PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation

Jan 10, 2025

Xinting Hu, Haoran Wang, Jan Eric Lenssen, Bernt Schiele

Abstract:We introduce PersonaHOI, a training- and tuning-free framework that fuses a general StableDiffusion model with a personalized face diffusion (PFD) model to generate identity-consistent human-object interaction (HOI) images. While existing PFD models have advanced significantly, they often overemphasize facial features at the expense of full-body coherence, PersonaHOI introduces an additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By incorporating cross-attention constraints in the PFD branch and spatial merging at both latent and residual levels, PersonaHOI preserves personalized facial details while ensuring interactive non-facial regions. Experiments, validated by a novel interaction alignment metric, demonstrate the superior realism and scalability of PersonaHOI, establishing a new standard for practical personalized face with HOI generation. Our code will be available at https://github.com/JoyHuYY1412/PersonaHOI

Via

Access Paper or Ask Questions

SLayR: Scene Layout Generation with Rectified Flow

Dec 06, 2024

Cameron Braunstein, Hevra Petekkaya, Jan Eric Lenssen, Mariya Toneva, Eddy Ilg

Figure 1 for SLayR: Scene Layout Generation with Rectified Flow

Figure 2 for SLayR: Scene Layout Generation with Rectified Flow

Figure 3 for SLayR: Scene Layout Generation with Rectified Flow

Figure 4 for SLayR: Scene Layout Generation with Rectified Flow

Abstract:We introduce SLayR, Scene Layout Generation with Rectified flow. State-of-the-art text-to-image models achieve impressive results. However, they generate images end-to-end, exposing no fine-grained control over the process. SLayR presents a novel transformer-based rectified flow model for layout generation over a token space that can be decoded into bounding boxes and corresponding labels, which can then be transformed into images using existing models. We show that established metrics for generated images are inconclusive for evaluating their underlying scene layout, and introduce a new benchmark suite, including a carefully designed repeatable human-evaluation procedure that assesses the plausibility and variety of generated layouts. In contrast to previous works, which perform well in either high variety or plausibility, we show that our approach performs well on both of these axes at the same time. It is also at least 5x times smaller in the number of parameters and 37% faster than the baselines. Our complete text-to-image pipeline demonstrates the added benefits of an interpretable and editable intermediate representation.

* 34 pages, 29 figures, 5 tables

Via

Access Paper or Ask Questions

ContextGNN: Beyond Two-Tower Recommendation Systems

Nov 29, 2024

Yiwen Yuan, Zecheng Zhang, Xinwei He, Akihiro Nitta, Weihua Hu, Dong Wang, Manan Shah, Shenyang Huang, Blaž Stojanovič, Alan Krumholz(+3 more)

Figure 1 for ContextGNN: Beyond Two-Tower Recommendation Systems

Figure 2 for ContextGNN: Beyond Two-Tower Recommendation Systems

Figure 3 for ContextGNN: Beyond Two-Tower Recommendation Systems

Figure 4 for ContextGNN: Beyond Two-Tower Recommendation Systems

Abstract:Recommendation systems predominantly utilize two-tower architectures, which evaluate user-item rankings through the inner product of their respective embeddings. However, one key limitation of two-tower models is that they learn a pair-agnostic representation of users and items. In contrast, pair-wise representations either scale poorly due to their quadratic complexity or are too restrictive on the candidate pairs to rank. To address these issues, we introduce Context-based Graph Neural Networks (ContextGNNs), a novel deep learning architecture for link prediction in recommendation systems. The method employs a pair-wise representation technique for familiar items situated within a user's local subgraph, while leveraging two-tower representations to facilitate the recommendation of exploratory items. A final network then predicts how to fuse both pair-wise and two-tower recommendations into a single ranking of items. We demonstrate that ContextGNN is able to adapt to different data characteristics and outperforms existing methods, both traditional and GNN-based, on a diverse set of practical recommendation tasks, improving performance by 20% on average.

* 14 pages, 1 figure, 5 tables

Via

Access Paper or Ask Questions

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Oct 30, 2024

Haiyang Wang, Yue Fan, Muhammad Ferjad Naeem, Yongqin Xian, Jan Eric Lenssen, Liwei Wang, Federico Tombari, Bernt Schiele

Figure 1 for TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Figure 2 for TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Figure 3 for TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Figure 4 for TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Abstract:Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce TokenFormer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at \url{https://github.com/Haiyang-W/TokenFormer}.

Via

Access Paper or Ask Questions

Spurfies: Sparse Surface Reconstruction using Local Geometry Priors

Aug 29, 2024

Kevin Raj, Christopher Wewer, Raza Yunus, Eddy Ilg, Jan Eric Lenssen

Figure 1 for Spurfies: Sparse Surface Reconstruction using Local Geometry Priors

Figure 2 for Spurfies: Sparse Surface Reconstruction using Local Geometry Priors

Figure 3 for Spurfies: Sparse Surface Reconstruction using Local Geometry Priors

Figure 4 for Spurfies: Sparse Surface Reconstruction using Local Geometry Priors

Abstract:We introduce Spurfies, a novel method for sparse-view surface reconstruction that disentangles appearance and geometry information to utilize local geometry priors trained on synthetic data. Recent research heavily focuses on 3D reconstruction using dense multi-view setups, typically requiring hundreds of images. However, these methods often struggle with few-view scenarios. Existing sparse-view reconstruction techniques often rely on multi-view stereo networks that need to learn joint priors for geometry and appearance from a large amount of data. In contrast, we introduce a neural point representation that disentangles geometry and appearance to train a local geometry prior using a subset of the synthetic ShapeNet dataset only. During inference, we utilize this surface prior as additional constraint for surface and appearance reconstruction from sparse input views via differentiable volume rendering, restricting the space of possible solutions. We validate the effectiveness of our method on the DTU dataset and demonstrate that it outperforms previous state of the art by 35% in surface quality while achieving competitive novel view synthesis quality. Moreover, in contrast to previous works, our method can be applied to larger, unbounded scenes, such as Mip-NeRF 360.

* https://geometric-rl.mpi-inf.mpg.de/spurfies/

Via

Access Paper or Ask Questions

InterTrack: Tracking Human Object Interaction without Object Templates

Aug 25, 2024

Xianghui Xie, Jan Eric Lenssen, Gerard Pons-Moll

Figure 1 for InterTrack: Tracking Human Object Interaction without Object Templates

Figure 2 for InterTrack: Tracking Human Object Interaction without Object Templates

Figure 3 for InterTrack: Tracking Human Object Interaction without Object Templates

Figure 4 for InterTrack: Tracking Human Object Interaction without Object Templates

Abstract:Tracking human object interaction from videos is important to understand human behavior from the rapidly growing stream of video data. Previous video-based methods require predefined object templates while single-image-based methods are template-free but lack temporal consistency. In this paper, we present a method to track human object interaction without any object shape templates. We decompose the 4D tracking problem into per-frame pose tracking and canonical shape optimization. We first apply a single-view reconstruction method to obtain temporally-inconsistent per-frame interaction reconstructions. Then, for the human, we propose an efficient autoencoder to predict SMPL vertices directly from the per-frame reconstructions, introducing temporally consistent correspondence. For the object, we introduce a pose estimator that leverages temporal information to predict smooth object rotations under occlusions. To train our model, we propose a method to generate synthetic interaction videos and synthesize in total 10 hour videos of 8.5k sequences with full 3D ground truth. Experiments on BEHAVE and InterCap show that our method significantly outperforms previous template-based video tracking and single-frame reconstruction methods. Our proposed synthetic video dataset also allows training video-based methods that generalize to real-world videos. Our code and dataset will be publicly released.

* 17 pages, 13 figures and 6 tables. Project page: https://virtualhumans.mpi-inf.mpg.de/InterTrack/

Via

Access Paper or Ask Questions

Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets

Aug 22, 2024

Wolfgang Boettcher, Lukas Hoyer, Ozan Unal, Jan Eric Lenssen, Bernt Schiele

Figure 1 for Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets

Figure 2 for Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets

Figure 3 for Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets

Figure 4 for Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets

Abstract:In this work, we introduce Scribbles for All, a label and training data generation algorithm for semantic segmentation trained on scribble labels. Training or fine-tuning semantic segmentation models with weak supervision has become an important topic recently and was subject to significant advances in model quality. In this setting, scribbles are a promising label type to achieve high quality segmentation results while requiring a much lower annotation effort than usual pixel-wise dense semantic segmentation annotations. The main limitation of scribbles as source for weak supervision is the lack of challenging datasets for scribble segmentation, which hinders the development of novel methods and conclusive evaluations. To overcome this limitation, Scribbles for All provides scribble labels for several popular segmentation datasets and provides an algorithm to automatically generate scribble labels for any dataset with dense annotations, paving the way for new insights and model advancements in the field of weakly supervised segmentation. In addition to providing datasets and algorithm, we evaluate state-of-the-art segmentation models on our datasets and show that models trained with our synthetic labels perform competitively with respect to models trained on manual labels. Thus, our datasets enable state-of-the-art research into methods for scribble-labeled semantic segmentation. The datasets, scribble generation algorithm, and baselines are publicly available at https://github.com/wbkit/Scribbles4All

* under review

Via

Access Paper or Ask Questions

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Jul 29, 2024

Yuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang, Jan Eric Lenssen

Figure 1 for Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Figure 2 for Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Figure 3 for Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Figure 4 for Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Abstract:Current visual foundation models are trained purely on unstructured 2D data, limiting their understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning on 3D-aware data improves the quality of emerging semantic features. We design a method to lift semantic 2D features into an efficient 3D Gaussian representation, which allows us to re-render them for arbitrary views. Using the rendered 3D-aware features, we design a fine-tuning strategy to transfer such 3D awareness into a 2D foundation model. We demonstrate that models fine-tuned in that way produce features that readily improve downstream task performance in semantic segmentation and depth estimation through simple linear probing. Notably, though fined-tuned on a single indoor dataset, the improvement is transferable to a variety of indoor datasets and out-of-domain datasets. We hope our study encourages the community to consider injecting 3D awareness when training 2D foundation models. Project page: https://ywyue.github.io/FiT3D.

* ECCV 2024. Project page: https://ywyue.github.io/FiT3D

Via

Access Paper or Ask Questions