Technical University of Munich
Abstract:Uncertainty quantification (UQ) is essential for assessing the reliability of Earth observation (EO) products. However, the extensive use of machine learning models in EO introduces an additional layer of complexity, as those models themselves are inherently uncertain. While various UQ methods do exist for machine learning models, their performance on EO datasets remains largely unevaluated. A key challenge in the community is the absence of the ground truth for uncertainty, i.e. how certain the uncertainty estimates are, apart from the labels for the image/signal. This article fills this gap by introducing three benchmark datasets specifically designed for UQ in EO machine learning models. These datasets address three common problem types in EO: regression, image segmentation, and scene classification. They enable a transparent comparison of different UQ methods for EO machine learning models. We describe the creation and characteristics of each dataset, including data sources, preprocessing steps, and label generation, with a particular focus on calculating the reference uncertainty. We also showcase baseline performance of several machine learning models on each dataset, highlighting the utility of these benchmarks for model development and comparison. Overall, this article offers a valuable resource for researchers and practitioners working in artificial intelligence for EO, promoting a more accurate and reliable quality measure of the outputs of machine learning models. The dataset and code are accessible via https://gitlab.lrz.de/ai4eo/WG_Uncertainty.
Abstract:We introduce NimbleD, an efficient self-supervised monocular depth estimation learning framework that incorporates supervision from pseudo-labels generated by a large vision model. This framework does not require camera intrinsics, enabling large-scale pre-training on publicly available videos. Our straightforward yet effective learning strategy significantly enhances the performance of fast and lightweight models without introducing any overhead, allowing them to achieve performance comparable to state-of-the-art self-supervised monocular depth estimation models. This advancement is particularly beneficial for virtual and augmented reality applications requiring low latency inference. The source code, model weights, and acknowledgments are available at https://github.com/xapaxca/nimbled .
Abstract:This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method. The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%.
Abstract:Multi-modal regression is important in forecasting nonstationary processes or with a complex mixture of distributions. It can be tackled with multiple hypotheses frameworks but with the difficulty of combining them efficiently in a learning model. A Structured Radial Basis Function Network is presented as an ensemble of multiple hypotheses predictors for regression problems. The predictors are regression models of any type that can form centroidal Voronoi tessellations which are a function of their losses during training. It is proved that this structured model can efficiently interpolate this tessellation and approximate the multiple hypotheses target distribution and is equivalent to interpolating the meta-loss of the predictors, the loss being a zero set of the interpolation error. This model has a fixed-point iteration algorithm between the predictors and the centers of the basis functions. Diversity in learning can be controlled parametrically by truncating the tessellation formation with the losses of individual predictors. A closed-form solution with least-squares is presented, which to the authors knowledge, is the fastest solution in the literature for multiple hypotheses and structured predictions. Superior generalization performance and computational efficiency is achieved using only two-layer neural networks as predictors controlling diversity as a key component of success. A gradient-descent approach is introduced which is loss-agnostic regarding the predictors. The expected value for the loss of the structured model with Gaussian basis functions is computed, finding that correlation between predictors is not an appropriate tool for diversification. The experiments show outperformance with respect to the top competitors in the literature.
Abstract:Change detection from traditional optical images has limited capability to model the changes in the height or shape of objects. Change detection using 3D point cloud aerial LiDAR survey data can fill this gap by providing critical depth information. While most existing machine learning based 3D point cloud change detection methods are supervised, they severely depend on the availability of annotated training data, which is in practice a critical point. To circumnavigate this dependence, we propose an unsupervised 3D point cloud change detection method mainly based on self-supervised learning using deep clustering and contrastive learning. The proposed method also relies on an adaptation of deep change vector analysis to 3D point cloud via nearest point comparison. Experiments conducted on a publicly available real dataset show that the proposed method obtains higher performance in comparison to the traditional unsupervised methods, with a gain of about 9% in mean accuracy (to reach more than 85%). Thus, it appears to be a relevant choice in scenario where prior knowledge (labels) is not ensured.
Abstract:Pansharpening enhances spatial details of high spectral resolution multispectral images using features of high spatial resolution panchromatic image. There are a number of traditional pansharpening approaches but producing an image exhibiting high spectral and spatial fidelity is still an open problem. Recently, deep learning has been used to produce promising pansharpened images; however, most of these approaches apply similar treatment to both multispectral and panchromatic images by using the same network for feature extraction. In this work, we present present a novel dual attention-based two-stream network. It starts with feature extraction using two separate networks for both images, an encoder with attention mechanism to recalibrate the extracted features. This is followed by fusion of the features forming a compact representation fed into an image reconstruction network to produce a pansharpened image. The experimental results on the Pl\'{e}iades dataset using standard quantitative evaluation metrics and visual inspection demonstrates that the proposed approach performs better than other approaches in terms of pansharpened image quality.
Abstract:Person re-identification (Re-ID) is one of the primary components of an automated visual surveillance system. It aims to automatically identify/search persons in a multi-camera network having non-overlapping field-of-views. Owing to its potential in various applications and research significance, a plethora of deep learning based re-Id approaches have been proposed in the recent years. However, there exist several vision related challenges, e.g., occlusion, pose scale \& viewpoint variance, background clutter, person misalignment and cross-domain generalization across camera modalities, which makes the problem of re-Id still far from being solved. Majority of the proposed approaches directly or indirectly aim to solve one or multiple of these existing challenges. In this context, a comprehensive review of current re-ID approaches in solving theses challenges is needed to analyze and focus on particular aspects for further advancements. At present, such a focused review does not exist and henceforth in this paper, we have presented a systematic challenge-specific literature survey of 230+ papers between the years of 2015-21. For the first time a survey of this type have been presented where the person re-Id approaches are reviewed in such solution-oriented perspective. Moreover, we have presented several diversified prominent developing trends in the respective research domain which will provide a visionary perspective regarding ongoing person re-Id research and eventually help to develop practical real world solutions.
Abstract:Pixel-level analysis of blood images plays a pivotal role in diagnosing blood-related diseases, especially Anaemia. These analyses mainly rely on an accurate diagnosis of morphological deformities like shape, size, and precise pixel counting. In traditional segmentation approaches, instance or object-based approaches have been adopted that are not feasible for pixel-level analysis. The convolutional neural network (CNN) model required a large dataset with detailed pixel-level information for the semantic segmentation of red blood cells in the deep learning domain. In current research work, we address these problems by proposing a multi-level deep convolutional encoder-decoder network along with two state-of-the-art healthy and Anaemic-RBC datasets. The proposed multi-level CNN model preserved pixel-level semantic information extracted in one layer and then passed to the next layer to choose relevant features. This phenomenon helps to precise pixel-level counting of healthy and anaemic-RBC elements along with morphological analysis. For experimental purposes, we proposed two state-of-the-art RBC datasets, i.e., Healthy-RBCs and Anaemic-RBCs dataset. Each dataset contains 1000 images, ground truth masks, relevant, complete blood count (CBC), and morphology reports for performance evaluation. The proposed model results were evaluated using crossmatch analysis with ground truth mask by finding IoU, individual training, validation, testing accuracies, and global accuracies using a 05-fold training procedure. This model got training, validation, and testing accuracies as 0.9856, 0.9760, and 0.9720 on the Healthy-RBC dataset and 0.9736, 0.9696, and 0.9591 on an Anaemic-RBC dataset. The IoU and BFScore of the proposed model were 0.9311, 0.9138, and 0.9032, 0.8978 on healthy and anaemic datasets, respectively.
Abstract:Automatic detection of weapons is significant for improving security and well being of individuals, nonetheless, it is a difficult task due to large variety of size, shape and appearance of weapons. View point variations and occlusion also are reasons which makes this task more difficult. Further, the current object detection algorithms process rectangular areas, however a slender and long rifle may really cover just a little portion of area and the rest may contain unessential details. To overcome these problem, we propose a CNN architecture for Orientation Aware Weapons Detection, which provides oriented bounding box with improved weapons detection performance. The proposed model provides orientation not only using angle as classification problem by dividing angle into eight classes but also angle as regression problem. For training our model for weapon detection a new dataset comprising of total 6400 weapons images is gathered from the web and then manually annotated with position oriented bounding boxes. Our dataset provides not only oriented bounding box as ground truth but also horizontal bounding box. We also provide our dataset in multiple formats of modern object detectors for further research in this area. The proposed model is evaluated on this dataset, and the comparative analysis with off-the shelf object detectors yields superior performance of proposed model, measured with standard evaluation strategies. The dataset and the model implementation are made publicly available at this link: https://bit.ly/2TyZICF.
Abstract:Semantic segmentation is a crucial step in many Earth observation tasks. Large quantity of pixel-level annotation is required to train deep networks for semantic segmentation. Earth observation techniques are applied to varieties of applications and since classes vary widely depending on the applications, therefore, domain knowledge is often required to label Earth observation images, impeding availability of labeled training data in many Earth observation applications. To tackle these challenges, in this paper we propose an unsupervised semantic segmentation method that can be trained using just a single unlabeled scene. Remote sensing scenes are generally large. The proposed method exploits this property to sample smaller patches from the larger scene and uses deep clustering and contrastive learning to refine the weights of a lightweight deep model composed of a series of the convolution layers along with an embedded channel attention. After unsupervised training on the target image/scene, the model automatically segregates the major classes present in the scene and produces the segmentation map. Experimental results on the Vaihingen dataset demonstrate the efficacy of the proposed method.