Abstract:Annotation errors are a challenge not only during training of machine learning models, but also during their evaluation. Label variations and inaccuracies in datasets often manifest as contradictory examples that deviate from established labeling conventions. Such inconsistencies, when significant, prevent models from achieving optimal performance on metrics such as mean Average Precision (mAP). We introduce the notion of "label convergence" to describe the highest achievable performance under the constraint of contradictory test annotations, essentially defining an upper bound on model accuracy. Recognizing that noise is an inherent characteristic of all data, our study analyzes five real-world datasets, including the LVIS dataset, to investigate the phenomenon of label convergence. We approximate that label convergence is between 62.63-67.52 mAP@[0.5:0.95:0.05] for LVIS with 95% confidence, attributing these bounds to the presence of real annotation errors. With current state-of-the-art (SOTA) models at the upper end of the label convergence interval for the well-studied LVIS dataset, we conclude that model capacity is sufficient to solve current object detection problems. Therefore, future efforts should focus on three key aspects: (1) updating the problem specification and adjusting evaluation practices to account for unavoidable label noise, (2) creating cleaner data, especially test data, and (3) including multi-annotated data to investigate annotation variation and make these issues visible from the outset.
Abstract:For effective structural damage assessment, the instances of damages need to be localized in the world of a 3D model. Due to a lack of data, the detection of structural anomalies can currently not be directly learned and performed in 3D space. In this work, a three-stage approach is presented, which uses the good performance of detection models on image level to segment instances of anomalies in the 3D space. In the detection stage, semantic segmentation predictions are produced on image level. The mapping stage transfers the image-level prediction onto the respective point cloud. In the extraction stage, 3D anomaly instances are extracted from the segmented point cloud. Cloud contraction is used to transform cracks into their medial axis representation. For areal anomalies the bounding polygon is extracted by means of alpha shapes. The approach covers the classes crack, spalling, and corrosion and the three image-level segmentation models TopoCrack, nnU-Net, and DetectionHMA are compared. Granted a localization tolerance of 4cm, IoUs of over 90% can be achieved for crack and corrosion and 41% for spalling, which appears to be a specifically challenging class. Detection on instance-level measured in AP is about 45% for crack and spalling and 73% for corrosion.
Abstract:The reliability of supervised machine learning systems depends on the accuracy and availability of ground truth labels. However, the process of human annotation, being prone to error, introduces the potential for noisy labels, which can impede the practicality of these systems. While training with noisy labels is a significant consideration, the reliability of test data is also crucial to ascertain the dependability of the results. A common approach to addressing this issue is repeated labeling, where multiple annotators label the same example, and their labels are combined to provide a better estimate of the true label. In this paper, we propose a novel localization algorithm that adapts well-established ground truth estimation methods for object detection and instance segmentation tasks. The key innovation of our method lies in its ability to transform combined localization and classification tasks into classification-only problems, thus enabling the application of techniques such as Expectation-Maximization (EM) or Majority Voting (MJV). Although our main focus is the aggregation of unique ground truth for test data, our algorithm also shows superior performance during training on the TexBiG dataset, surpassing both noisy label training and label aggregation using Weighted Boxes Fusion (WBF). Our experiments indicate that the benefits of repeated labels emerge under specific dataset and annotation configurations. The key factors appear to be (1) dataset complexity, the (2) annotator consistency, and (3) the given annotation budget constraints.