Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael Heizmann

Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription

Jun 17, 2025

Anna Hamberger, Sebastian Murgul, Jochen Schmidt, Michael Heizmann

Abstract:Music transcription plays a pivotal role in Music Information Retrieval (MIR), particularly for stringed instruments like the guitar, where symbolic music notations such as MIDI lack crucial playability information. This contribution introduces the Fretting-Transformer, an encoderdecoder model that utilizes a T5 transformer architecture to automate the transcription of MIDI sequences into guitar tablature. By framing the task as a symbolic translation problem, the model addresses key challenges, including string-fret ambiguity and physical playability. The proposed system leverages diverse datasets, including DadaGP, GuitarToday, and Leduc, with novel data pre-processing and tokenization strategies. We have developed metrics for tablature accuracy and playability to quantitatively evaluate the performance. The experimental results demonstrate that the Fretting-Transformer surpasses baseline methods like A* and commercial applications like Guitar Pro. The integration of context-sensitive processing and tuning/capo conditioning further enhances the model's performance, laying a robust foundation for future developments in automated guitar transcription.

* Accepted to the 50th International Computer Music Conference (ICMC), 2025

Via

Access Paper or Ask Questions

6D Pose Estimation on Point Cloud Data through Prior Knowledge Integration: A Case Study in Autonomous Disassembly

May 30, 2025

Chengzhi Wu, Hao Fu, Jan-Philipp Kaiser, Erik Tabuchi Barczak, Julius Pfrommer, Gisela Lanza, Michael Heizmann, Jürgen Beyerer

Abstract:The accurate estimation of 6D pose remains a challenging task within the computer vision domain, even when utilizing 3D point cloud data. Conversely, in the manufacturing domain, instances arise where leveraging prior knowledge can yield advancements in this endeavor. This study focuses on the disassembly of starter motors to augment the engineering of product life cycles. A pivotal objective in this context involves the identification and 6D pose estimation of bolts affixed to the motors, facilitating automated disassembly within the manufacturing workflow. Complicating matters, the presence of occlusions and the limitations of single-view data acquisition, notably when motors are placed in a clamping system, obscure certain portions and render some bolts imperceptible. Consequently, the development of a comprehensive pipeline capable of acquiring complete bolt information is imperative to avoid oversight in bolt detection. In this paper, employing the task of bolt detection within the scope of our project as a pertinent use case, we introduce a meticulously devised pipeline. This multi-stage pipeline effectively captures the 6D information with regard to all bolts on the motor, thereby showcasing the effective utilization of prior knowledge in handling this challenging task. The proposed methodology not only contributes to the field of 6D pose estimation but also underscores the viability of integrating domain-specific insights to tackle complex problems in manufacturing and automation.

Via

Access Paper or Ask Questions

Randomized 3D Scene Generation for Generalizable Self-supervised Pre-training

Jun 07, 2023

Lanxiao Li, Michael Heizmann

Abstract:Capturing and labeling real-world 3D data is laborious and time-consuming, which makes it costly to train strong 3D models. To address this issue, previous works generate randomized 3D scenes and pre-train models on generated data. Although the pre-trained models gain promising performance boosts, previous works have two major shortcomings. First, they focus on only one downstream task (i.e., object detection). Second, a fair comparison of generated data is still lacking. In this work, we systematically compare data generation methods using a unified setup. To clarify the generalization of the pre-trained models, we evaluate their performance in multiple tasks (e.g., object detection and semantic segmentation) and with different pre-training methods (e.g., masked autoencoder and contrastive learning). Moreover, we propose a new method to generate 3D scenes with spherical harmonics. It surpasses the previous formula-driven method with a clear margin and achieves on-par results with methods using real-world scans and CAD models.

Via

Access Paper or Ask Questions

Tinto: Multisensor Benchmark for 3D Hyperspectral Point Cloud Segmentation in the Geosciences

May 17, 2023

Ahmed J. Afifi, Samuel T. Thiele, Sandra Lorenz, Pedram Ghamisi, Raimon Tolosana-Delgado, Moritz Kirsch, Richard Gloaguen, Michael Heizmann

Abstract:The increasing use of deep learning techniques has reduced interpretation time and, ideally, reduced interpreter bias by automatically deriving geological maps from digital outcrop models. However, accurate validation of these automated mapping approaches is a significant challenge due to the subjective nature of geological mapping and the difficulty in collecting quantitative validation data. Additionally, many state-of-the-art deep learning methods are limited to 2D image data, which is insufficient for 3D digital outcrops, such as hyperclouds. To address these challenges, we present Tinto, a multi-sensor benchmark digital outcrop dataset designed to facilitate the development and validation of deep learning approaches for geological mapping, especially for non-structured 3D data like point clouds. Tinto comprises two complementary sets: 1) a real digital outcrop model from Corta Atalaya (Spain), with spectral attributes and ground-truth data, and 2) a synthetic twin that uses latent features in the original datasets to reconstruct realistic spectral data (including sensor noise and processing artifacts) from the ground-truth. The point cloud is dense and contains 3,242,964 labeled points. We used these datasets to explore the abilities of different deep learning approaches for automated geological mapping. By making Tinto publicly available, we hope to foster the development and adaptation of new deep learning tools for 3D applications in Earth sciences. The dataset can be accessed through this link: https://doi.org/10.14278/rodare.2256.

Via

Access Paper or Ask Questions

Applying Plain Transformers to Real-World Point Clouds

Mar 04, 2023

Lanxiao Li, Michael Heizmann

Abstract:Due to the lack of inductive bias, transformer-based models usually require a large amount of training data. The problem is especially concerning in 3D vision, as 3D data are harder to acquire and annotate. To overcome this problem, previous works modify the architecture of transformers to incorporate inductive biases by applying, e.g., local attention and down-sampling. Although they have achieved promising results, earlier works on transformers for point clouds have two issues. First, the power of plain transformers is still under-explored. Second, they focus on simple and small point clouds instead of complex real-world ones. This work revisits the plain transformers in real-world point cloud understanding. We first take a closer look at some fundamental components of plain transformers, e.g., patchifier and positional embedding, for both efficiency and performance. To close the performance gap due to the lack of inductive bias and annotated data, we investigate self-supervised pre-training with masked autoencoder (MAE). Specifically, we propose drop patch, which prevents information leakage and significantly improves the effectiveness of MAE. Our models achieve SOTA results in semantic segmentation on the S3DIS dataset and object detection on the ScanNet dataset with lower computational costs. Our work provides a new baseline for future research on transformers for point clouds.

Via

Access Paper or Ask Questions

MotorFactory: A Blender Add-on for Large Dataset Generation of Small Electric Motors

Jan 11, 2023

Chengzhi Wu, Kanran Zhou, Jan-Philipp Kaiser, Norbert Mitschke, Jan-Felix Klein, Julius Pfrommer, Jürgen Beyerer, Gisela Lanza, Michael Heizmann, Kai Furmans

Figure 1 for MotorFactory: A Blender Add-on for Large Dataset Generation of Small Electric Motors

Figure 2 for MotorFactory: A Blender Add-on for Large Dataset Generation of Small Electric Motors

Figure 3 for MotorFactory: A Blender Add-on for Large Dataset Generation of Small Electric Motors

Figure 4 for MotorFactory: A Blender Add-on for Large Dataset Generation of Small Electric Motors

Abstract:To enable automatic disassembly of different product types with uncertain conditions and degrees of wear in remanufacturing, agile production systems that can adapt dynamically to changing requirements are needed. Machine learning algorithms can be employed due to their generalization capabilities of learning from various types and variants of products. However, in reality, datasets with a diversity of samples that can be used to train models are difficult to obtain in the initial period. This may cause bad performances when the system tries to adapt to new unseen input data in the future. In order to generate large datasets for different learning purposes, in our project, we present a Blender add-on named MotorFactory to generate customized mesh models of various motor instances. MotorFactory allows to create mesh models which, complemented with additional add-ons, can be further used to create synthetic RGB images, depth images, normal images, segmentation ground truth masks, and 3D point cloud datasets with point-wise semantic labels. The created synthetic datasets may be used for various tasks including motor type classification, object detection for decentralized material transfer tasks, part segmentation for disassembly and handling tasks, or even reinforcement learning-based robotics control or view-planning.

Via

Access Paper or Ask Questions

Radar Odometry on SE(3) with Constant Acceleration Motion Prior and Polar Measurement Model

Sep 11, 2022

Kyle Retan, Frasher Loshaj, Michael Heizmann

Figure 1 for Radar Odometry on SE(3) with Constant Acceleration Motion Prior and Polar Measurement Model

Figure 2 for Radar Odometry on SE(3) with Constant Acceleration Motion Prior and Polar Measurement Model

Figure 3 for Radar Odometry on SE(3) with Constant Acceleration Motion Prior and Polar Measurement Model

Figure 4 for Radar Odometry on SE(3) with Constant Acceleration Motion Prior and Polar Measurement Model

Abstract:This paper presents an approach to radar odometry on $SE(3)$ which utilizes a constant acceleration motion prior. The motion prior is integrated into a sliding window optimization scheme. We use the Magnus expansion to accurately integrate the motion prior while maintaining real-time performance. In addition, we adopt a polar measurement model to better represent radar detection uncertainties. Our estimator is evaluated using a large real-world dataset from a prototype high-resolution radar sensor. The new motion prior and measurement model signifcantly improve odometry performance relative to the constant velocity motion prior and Cartesian measurement model from our previous work, particularly in roll, pitch and height.

Via

Access Paper or Ask Questions

A Closer Look at Invariances in Self-supervised Pre-training for 3D Vision

Jul 13, 2022

Lanxiao Li, Michael Heizmann

Figure 1 for A Closer Look at Invariances in Self-supervised Pre-training for 3D Vision

Figure 2 for A Closer Look at Invariances in Self-supervised Pre-training for 3D Vision

Figure 3 for A Closer Look at Invariances in Self-supervised Pre-training for 3D Vision

Figure 4 for A Closer Look at Invariances in Self-supervised Pre-training for 3D Vision

Abstract:Self-supervised pre-training for 3D vision has drawn increasing research interest in recent years. In order to learn informative representations, a lot of previous works exploit invariances of 3D features, e.g., perspective-invariance between views of the same scene, modality-invariance between depth and RGB images, format-invariance between point clouds and voxels. Although they have achieved promising results, previous researches lack a systematic and fair comparison of these invariances. To address this issue, our work, for the first time, introduces a unified framework, under which various pre-training methods can be investigated. We conduct extensive experiments and provide a closer look at the contributions of different invariances in 3D pre-training. Also, we propose a simple but effective method that jointly pre-trains a 3D encoder and a depth map encoder using contrastive learning. Models pre-trained with our method gain significant performance boost in downstream tasks. For instance, a pre-trained VoteNet outperforms previous methods on SUN RGB-D and ScanNet object detection benchmarks with a clear margin.

* Accepted on ECCV 2022

Via

Access Paper or Ask Questions

Spectral Reconstruction and Disparity from Spatio-Spectrally Coded Light Fields via Multi-Task Deep Learning

Mar 18, 2021

Maximilian Schambach, Jiayang Shi, Michael Heizmann

Figure 1 for Spectral Reconstruction and Disparity from Spatio-Spectrally Coded Light Fields via Multi-Task Deep Learning

Figure 2 for Spectral Reconstruction and Disparity from Spatio-Spectrally Coded Light Fields via Multi-Task Deep Learning

Figure 3 for Spectral Reconstruction and Disparity from Spatio-Spectrally Coded Light Fields via Multi-Task Deep Learning

Figure 4 for Spectral Reconstruction and Disparity from Spatio-Spectrally Coded Light Fields via Multi-Task Deep Learning

Abstract:We present a novel method to reconstruct a spectral central view and its aligned disparity map from spatio-spectrally coded light fields. Since we do not reconstruct an intermediate full light field from the coded measurement, we refer to this as principal reconstruction. The coded light fields correspond to those captured by a light field camera in the unfocused design with a spectrally coded microlens array. In this application, the spectrally coded light field camera can be interpreted as a single-shot spectral depth camera. We investigate several multi-task deep learning methods and propose a new auxiliary loss-based training strategy to enhance the reconstruction performance. The results are evaluated using a synthetic as well as a new real-world spectral light field dataset that we captured using a custom-built camera. The results are compared to state-of-the art compressed sensing reconstruction and disparity estimation. We achieve a high reconstruction quality for both synthetic and real-world coded light fields. The disparity estimation quality is on par with or even outperforms state-of-the-art disparity estimation from uncoded RGB light fields.

Via

Access Paper or Ask Questions