Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

David T. Hoffmann

Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation

Mar 06, 2025

David T. Hoffmann, Syed Haseeb Raza, Hanqiu Jiang, Denis Tananaev, Steffen Klingenhoefer, Martin Meinke

Abstract:Scene flow estimation is a foundational task for many robotic applications, including robust dynamic object detection, automatic labeling, and sensor synchronization. Two types of approaches to the problem have evolved: 1) Supervised and 2) optimization-based methods. Supervised methods are fast during inference and achieve high-quality results, however, they are limited by the need for large amounts of labeled training data and are susceptible to domain gaps. In contrast, unsupervised test-time optimization methods do not face the problem of domain gaps but usually suffer from substantial runtime, exhibit artifacts, or fail to converge to the right solution. In this work, we mitigate several limitations of existing optimization-based methods. To this end, we 1) introduce a simple voxel grid-based model that improves over the standard MLP-based formulation in multiple dimensions and 2) introduce a new multiframe loss formulation. 3) We combine both contributions in our new method, termed Floxels. On the Argoverse 2 benchmark, Floxels is surpassed only by EulerFlow among unsupervised methods while achieving comparable performance at a fraction of the computational cost. Floxels achieves a massive speedup of more than ~60 - 140x over EulerFlow, reducing the runtime from a day to 10 minutes per sequence. Over the faster but low-quality baseline, NSFP, Floxels achieves a speedup of ~14x.

* Accepted at CVPR 2025

Via

Access Paper or Ask Questions

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning

Apr 11, 2024

Simon Schrodi, David T. Hoffmann, Max Argus, Volker Fischer, Thomas Brox

Figure 1 for Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning

Figure 2 for Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning

Figure 3 for Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning

Figure 4 for Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning

Abstract:Contrastive vision-language models like CLIP have gained popularity for their versatile applicable learned representations in various downstream tasks. Despite their successes in some tasks, like zero-shot image recognition, they also perform surprisingly poor on other tasks, like attribute detection. Previous work has attributed these challenges to the modality gap, a separation of image and text in the shared representation space, and a bias towards objects over other factors, such as attributes. In this work we investigate both phenomena. We find that only a few embedding dimensions drive the modality gap. Further, we propose a measure for object bias and find that object bias does not lead to worse performance on other concepts, such as attributes. But what leads to the emergence of the modality gap and object bias? To answer this question we carefully designed an experimental setting which allows us to control the amount of shared information between the modalities. This revealed that the driving factor behind both, the modality gap and the object bias, is the information imbalance between images and captions.

Via

Access Paper or Ask Questions

Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

Oct 19, 2023

David T. Hoffmann, Simon Schrodi, Nadine Behrmann, Volker Fischer, Thomas Brox

Figure 1 for Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

Figure 2 for Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

Figure 3 for Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

Figure 4 for Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

Abstract:In this work, we study rapid, step-wise improvements of the loss in transformers when being confronted with multi-step decision tasks. We found that transformers struggle to learn the intermediate tasks, whereas CNNs have no such issue on the tasks we studied. When transformers learn the intermediate task, they do this rapidly and unexpectedly after both training and validation loss saturated for hundreds of epochs. We call these rapid improvements Eureka-moments, since the transformer appears to suddenly learn a previously incomprehensible task. Similar leaps in performance have become known as Grokking. In contrast to Grokking, for Eureka-moments, both the validation and the training loss saturate before rapidly improving. We trace the problem back to the Softmax function in the self-attention block of transformers and show ways to alleviate the problem. These fixes improve training speed. The improved models reach 95% of the baseline model in just 20% of training steps while having a much higher likelihood to learn the intermediate task, lead to higher final accuracy and are more robust to hyper-parameters.

Via

Access Paper or Ask Questions

Ranking Info Noise Contrastive Estimation: Boosting Contrastive Learning via Ranked Positives

Jan 27, 2022

David T. Hoffmann, Nadine Behrmann, Juergen Gall, Thomas Brox, Mehdi Noroozi

Figure 1 for Ranking Info Noise Contrastive Estimation: Boosting Contrastive Learning via Ranked Positives

Figure 2 for Ranking Info Noise Contrastive Estimation: Boosting Contrastive Learning via Ranked Positives

Figure 3 for Ranking Info Noise Contrastive Estimation: Boosting Contrastive Learning via Ranked Positives

Figure 4 for Ranking Info Noise Contrastive Estimation: Boosting Contrastive Learning via Ranked Positives

Abstract:This paper introduces Ranking Info Noise Contrastive Estimation (RINCE), a new member in the family of InfoNCE losses that preserves a ranked ordering of positive samples. In contrast to the standard InfoNCE loss, which requires a strict binary separation of the training pairs into similar and dissimilar samples, RINCE can exploit information about a similarity ranking for learning a corresponding embedding space. We show that the proposed loss function learns favorable embeddings compared to the standard InfoNCE whenever at least noisy ranking information can be obtained or when the definition of positives and negatives is blurry. We demonstrate this for a supervised classification task with additional superclass labels and noisy similarity scores. Furthermore, we show that RINCE can also be applied to unsupervised training with experiments on unsupervised representation learning from videos. In particular, the embedding yields higher classification accuracy, retrieval rates and performs better in out-of-distribution detection than the standard InfoNCE loss.

* AAAI 2022 (Main Track)

Via

Access Paper or Ask Questions

AGORA: Avatars in Geography Optimized for Regression Analysis

Apr 29, 2021

Priyanka Patel, Chun-Hao P. Huang, Joachim Tesch, David T. Hoffmann, Shashank Tripathi, Michael J. Black

Figure 1 for AGORA: Avatars in Geography Optimized for Regression Analysis

Figure 2 for AGORA: Avatars in Geography Optimized for Regression Analysis

Figure 3 for AGORA: Avatars in Geography Optimized for Regression Analysis

Figure 4 for AGORA: Avatars in Geography Optimized for Regression Analysis

Abstract:While the accuracy of 3D human pose estimation from images has steadily improved on benchmark datasets, the best methods still fail in many real-world scenarios. This suggests that there is a domain gap between current datasets and common scenes containing people. To obtain ground-truth 3D pose, current datasets limit the complexity of clothing, environmental conditions, number of subjects, and occlusion. Moreover, current datasets evaluate sparse 3D joint locations corresponding to the major joints of the body, ignoring the hand pose and the face shape. To evaluate the current state-of-the-art methods on more challenging images, and to drive the field to address new problems, we introduce AGORA, a synthetic dataset with high realism and highly accurate ground truth. Here we use 4240 commercially-available, high-quality, textured human scans in diverse poses and natural clothing; this includes 257 scans of children. We create reference 3D poses and body shapes by fitting the SMPL-X body model (with face and hands) to the 3D scans, taking into account clothing. We create around 14K training and 3K test images by rendering between 5 and 15 people per image using either image-based lighting or rendered 3D environments, taking care to make the images physically plausible and photoreal. In total, AGORA consists of 173K individual person crops. We evaluate existing state-of-the-art methods for 3D human pose estimation on this dataset and find that most methods perform poorly on images of children. Hence, we extend the SMPL-X model to better capture the shape of children. Additionally, we fine-tune methods on AGORA and show improved performance on both AGORA and 3DPW, confirming the realism of the dataset. We provide all the registered 3D reference training data, rendered images, and a web-based evaluation site at https://agora.is.tue.mpg.de/.

* CVPR 2021

Via

Access Paper or Ask Questions

Learning Multi-Human Optical Flow

Oct 24, 2019

Anurag Ranjan, David T. Hoffmann, Dimitrios Tzionas, Siyu Tang, Javier Romero, Michael J. Black

Figure 1 for Learning Multi-Human Optical Flow

Figure 2 for Learning Multi-Human Optical Flow

Figure 3 for Learning Multi-Human Optical Flow

Figure 4 for Learning Multi-Human Optical Flow

Abstract:The optical flow of humans is well known to be useful for the analysis of human action. Recent optical flow methods focus on training deep networks to approach the problem. However, the training data used by them does not cover the domain of human motion. Therefore, we develop a dataset of multi-human optical flow and train optical flow networks on this dataset. We use a 3D model of the human body and motion capture data to synthesize realistic flow fields in both single- and multi-person images. We then train optical flow networks to estimate human flow fields from pairs of images. We demonstrate that our trained networks are more accurate than a wide range of top methods on held-out test data and that they can generalize well to real image sequences. The code, trained models and the dataset are available for research.

* arXiv admin note: text overlap with arXiv:1806.05666

Via

Access Paper or Ask Questions

Learning to Train with Synthetic Humans

Aug 02, 2019

David T. Hoffmann, Dimitrios Tzionas, Micheal J. Black, Siyu Tang

Figure 1 for Learning to Train with Synthetic Humans

Figure 2 for Learning to Train with Synthetic Humans

Figure 3 for Learning to Train with Synthetic Humans

Figure 4 for Learning to Train with Synthetic Humans

Abstract:Neural networks need big annotated datasets for training. However, manual annotation can be too expensive or even unfeasible for certain tasks, like multi-person 2D pose estimation with severe occlusions. A remedy for this is synthetic data with perfect ground truth. Here we explore two variations of synthetic data for this challenging problem; a dataset with purely synthetic humans and a real dataset augmented with synthetic humans. We then study which approach better generalizes to real data, as well as the influence of virtual humans in the training loss. Using the augmented dataset, without considering synthetic humans in the loss, leads to the best results. We observe that not all synthetic samples are equally informative for training, while the informative samples are different for each training stage. To exploit this observation, we employ an adversarial student-teacher framework; the teacher improves the student by providing the hardest samples for its current state as a challenge. Experiments show that the student-teacher framework outperforms normal training on the purely synthetic dataset.

* In German Conference on Pattern Recognition (GCPR)

Via

Access Paper or Ask Questions