Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rainer Lienhart

MMMS: Multi-Modal Multi-Surface Interactive Segmentation

Sep 16, 2025

Robin Schön, Julian Lorenz, Katja Ludwig, Daniel Kienzle, Rainer Lienhart

Abstract:In this paper, we present a method to interactively create segmentation masks on the basis of user clicks. We pay particular attention to the segmentation of multiple surfaces that are simultaneously present in the same image. Since these surfaces may be heavily entangled and adjacent, we also present a novel extended evaluation metric that accounts for the challenges of this scenario. Additionally, the presented method is able to use multi-modal inputs to facilitate the segmentation task. At the center of this method is a network architecture which takes as input an RGB image, a number of non-RGB modalities, an erroneous mask, and encoded clicks. Based on this input, the network predicts an improved segmentation mask. We design our architecture such that it adheres to two conditions: (1) The RGB backbone is only available as a black-box. (2) To reduce the response time, we want our model to integrate the interaction-specific information after the image feature extraction and the multi-modal fusion. We refer to the overall task as Multi-Modal Multi-Surface interactive segmentation (MMMS). We are able to show the effectiveness of our multi-modal fusion strategy. Using additional modalities, our system reduces the NoC@90 by up to 1.28 clicks per surface on average on DeLiVER and up to 1.19 on MFNet. On top of this, we are able to show that our RGB-only baseline achieves competitive, and in some cases even superior performance when tested in a classical, single-mask interactive segmentation scenario.

* 19 pages, 11 figures, 10 pages

Via

Access Paper or Ask Questions

Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models

Jul 03, 2025

Christian Jaumann, Annemarie Friedrich, Rainer Lienhart

Abstract:This paper describes our system for the SciVQA 2025 Shared Task on Scientific Visual Question Answering. Our system employs an ensemble of two Multimodal Large Language Models and various few-shot example retrieval strategies. The model and few-shot setting are selected based on the figure and question type. We also select answers based on the models' confidence levels. On the blind test data, our system ranks third out of seven with an average F1 score of 85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.

* Accepted at 5th Workshop on Scholarly Document Processing @ ACL 2025

Via

Access Paper or Ask Questions

CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations

Jun 26, 2025

Julian Lorenz, Mrunmai Phatak, Robin Schön, Katja Ludwig, Nico Hörmann, Annemarie Friedrich, Rainer Lienhart

Abstract:2D scene graphs provide a structural and explainable framework for scene understanding. However, current work still struggles with the lack of accurate scene graph data. To overcome this data bottleneck, we present CoPa-SG, a synthetic scene graph dataset with highly precise ground truth and exhaustive relation annotations between all objects. Moreover, we introduce parametric and proto-relations, two new fundamental concepts for scene graphs. The former provides a much more fine-grained representation than its traditional counterpart by enriching relations with additional parameters such as angles or distances. The latter encodes hypothetical relations in a scene graph and describes how relations would form if new objects are placed in the scene. Using CoPa-SG, we compare the performance of various scene graph generation models. We demonstrate how our new relation types can be integrated in downstream applications to enhance planning and reasoning capabilities.

Via

Access Paper or Ask Questions

HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions

Jun 24, 2025

Mrunmai Vivek Phatak, Julian Lorenz, Nico Hörmann, Jörg Hähner, Rainer Lienhart

Abstract:When humans and robotic agents coexist in an environment, scene understanding becomes crucial for the agents to carry out various downstream tasks like navigation and planning. Hence, an agent must be capable of localizing and identifying actions performed by the human. Current research lacks reliable datasets for performing scene understanding within indoor environments where humans are also a part of the scene. Scene Graphs enable us to generate a structured representation of a scene or an image to perform visual scene understanding. To tackle this, we present HOIverse a synthetic dataset at the intersection of scene graph and human-object interaction, consisting of accurate and dense relationship ground truths between humans and surrounding objects along with corresponding RGB images, segmentation masks, depth images and human keypoints. We compute parametric relations between various pairs of objects and human-object pairs, resulting in an accurate and unambiguous relation definitions. In addition, we benchmark our dataset on state-of-the-art scene graph generation models to predict parametric relations and human-object interactions. Through this dataset, we aim to accelerate research in the field of scene understanding involving people.

Via

Access Paper or Ask Questions

Towards Ball Spin and Trajectory Analysis in Table Tennis Broadcast Videos via Physically Grounded Synthetic-to-Real Transfer

Apr 28, 2025

Daniel Kienzle, Robin Schön, Rainer Lienhart, Shin'Ichi Satoh

Abstract:Analyzing a player's technique in table tennis requires knowledge of the ball's 3D trajectory and spin. While, the spin is not directly observable in standard broadcasting videos, we show that it can be inferred from the ball's trajectory in the video. We present a novel method to infer the initial spin and 3D trajectory from the corresponding 2D trajectory in a video. Without ground truth labels for broadcast videos, we train a neural network solely on synthetic data. Due to the choice of our input data representation, physically correct synthetic training data, and using targeted augmentations, the network naturally generalizes to real data. Notably, these simple techniques are sufficient to achieve generalization. No real data at all is required for training. To the best of our knowledge, we are the first to present a method for spin and trajectory prediction in simple monocular broadcast videos, achieving an accuracy of 92.0% in spin classification and a 2D reprojection error of 0.19% of the image diagonal.

* To be published in 2025 IEEE/CVF International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Via

Access Paper or Ask Questions

Efficient 2D to Full 3D Human Pose Uplifting including Joint Rotations

Apr 14, 2025

Katja Ludwig, Yuliia Oksymets, Robin Schön, Daniel Kienzle, Rainer Lienhart

Abstract:In sports analytics, accurately capturing both the 3D locations and rotations of body joints is essential for understanding an athlete's biomechanics. While Human Mesh Recovery (HMR) models can estimate joint rotations, they often exhibit lower accuracy in joint localization compared to 3D Human Pose Estimation (HPE) models. Recent work addressed this limitation by combining a 3D HPE model with inverse kinematics (IK) to estimate both joint locations and rotations. However, IK is computationally expensive. To overcome this, we propose a novel 2D-to-3D uplifting model that directly estimates 3D human poses, including joint rotations, in a single forward pass. We investigate multiple rotation representations, loss functions, and training strategies - both with and without access to ground truth rotations. Our models achieve state-of-the-art accuracy in rotation estimation, are 150 times faster than the IK-based approach, and surpass HMR models in joint localization precision.

* accepted at CVSports@CVPR'25

Via

Access Paper or Ask Questions

On the Importance of Conditioning for Privacy-Preserving Data Augmentation

Apr 08, 2025

Julian Lorenz, Katja Ludwig, Valentin Haug, Rainer Lienhart

Abstract:Latent diffusion models can be used as a powerful augmentation method to artificially extend datasets for enhanced training. To the human eye, these augmented images look very different to the originals. Previous work has suggested to use this data augmentation technique for data anonymization. However, we show that latent diffusion models that are conditioned on features like depth maps or edges to guide the diffusion process are not suitable as a privacy preserving method. We use a contrastive learning approach to train a model that can correctly identify people out of a pool of candidates. Moreover, we demonstrate that anonymization using conditioned diffusion models is susceptible to black box attacks. We attribute the success of the described methods to the conditioning of the latent diffusion model in the anonymization process. The diffusion model is instructed to produce similar edges for the anonymized images. Hence, a model can learn to recognize these patterns for identification.

Via

Access Paper or Ask Questions

SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts

Jan 14, 2025

Robin Schön, Julian Lorenz, Daniel Kienzle, Rainer Lienhart

Abstract:In this paper, we present a novel architecture for interactive segmentation in winter sports contexts. The field of interactive segmentation deals with the prediction of high-quality segmentation masks by informing the network about the objects position with the help of user guidance. In our case the guidance consists of click prompts. For this task, we first present a baseline architecture which is specifically geared towards quickly responding after each click. Afterwards, we motivate and describe a number of architectural modifications which improve the performance when tasked with segmenting winter sports equipment on the WSESeg dataset. With regards to the average NoC@85 metric on the WSESeg classes, we outperform SAM and HQ-SAM by 2.336 and 7.946 clicks, respectively. When applied to the HQSeg-44k dataset, our system delivers state-of-the-art results with a NoC@90 of 6.00 and NoC@95 of 9.89. In addition to that, we test our model on a novel dataset containing masks for humans during skiing.

* 4 figures, 6 tables, 12 pages

Via

Access Paper or Ask Questions

Harnessing Event Sensory Data for Error Pattern Prediction in Vehicles: A Language Model Approach

Dec 17, 2024

Hugo Math, Rainer Lienhart, Robin Schön

Figure 1 for Harnessing Event Sensory Data for Error Pattern Prediction in Vehicles: A Language Model Approach

Figure 2 for Harnessing Event Sensory Data for Error Pattern Prediction in Vehicles: A Language Model Approach

Figure 3 for Harnessing Event Sensory Data for Error Pattern Prediction in Vehicles: A Language Model Approach

Figure 4 for Harnessing Event Sensory Data for Error Pattern Prediction in Vehicles: A Language Model Approach

Abstract:In this paper, we draw an analogy between processing natural languages and processing multivariate event streams from vehicles in order to predict $\textit{when}$ and $\textit{what}$ error pattern is most likely to occur in the future for a given car. Our approach leverages the temporal dynamics and contextual relationships of our event data from a fleet of cars. Event data is composed of discrete values of error codes as well as continuous values such as time and mileage. Modelled by two causal Transformers, we can anticipate vehicle failures and malfunctions before they happen. Thus, we introduce $\textit{CarFormer}$, a Transformer model trained via a new self-supervised learning strategy, and $\textit{EPredictor}$, an autoregressive Transformer decoder model capable of predicting $\textit{when}$ and $\textit{what}$ error pattern will most likely occur after some error code apparition. Despite the challenges of high cardinality of event types, their unbalanced frequency of appearance and limited labelled data, our experimental results demonstrate the excellent predictive ability of our novel model. Specifically, with sequences of $160$ error codes on average, our model is able with only half of the error codes to achieve $80\%$ F1 score for predicting $\textit{what}$ error pattern will occur and achieves an average absolute error of $58.4 \pm 13.2$h $\textit{when}$ forecasting the time of occurrence, thus enabling confident predictive maintenance and enhancing vehicle safety.

* 10 pages, 8 figures, accepted to AAAI 2025

Via

Access Paper or Ask Questions

Leveraging Anthropometric Measurements to Improve Human Mesh Estimation and Ensure Consistent Body Shapes

Sep 27, 2024

Katja Ludwig, Julian Lorenz, Daniel Kienzle, Tuan Bui, Rainer Lienhart

Figure 1 for Leveraging Anthropometric Measurements to Improve Human Mesh Estimation and Ensure Consistent Body Shapes

Figure 2 for Leveraging Anthropometric Measurements to Improve Human Mesh Estimation and Ensure Consistent Body Shapes

Figure 3 for Leveraging Anthropometric Measurements to Improve Human Mesh Estimation and Ensure Consistent Body Shapes

Figure 4 for Leveraging Anthropometric Measurements to Improve Human Mesh Estimation and Ensure Consistent Body Shapes

Abstract:The basic body shape of a person does not change within a single video. However, most SOTA human mesh estimation (HME) models output a slightly different body shape for each video frame, which results in inconsistent body shapes for the same person. In contrast, we leverage anthropometric measurements like tailors are already obtaining from humans for centuries. We create a model called A2B that converts such anthropometric measurements to body shape parameters of human mesh models. Moreover, we find that finetuned SOTA 3D human pose estimation (HPE) models outperform HME models regarding the precision of the estimated keypoints. We show that applying inverse kinematics (IK) to the results of such a 3D HPE model and combining the resulting body pose with the A2B body shape leads to superior and consistent human meshes for challenging datasets like ASPset or fit3D, where we can lower the MPJPE by over 30 mm compared to SOTA HME models. Further, replacing HME models estimates of the body shape parameters with A2B model results not only increases the performance of these HME models, but also leads to consistent body shapes.

Via

Access Paper or Ask Questions