Department of Computer Science, University of Kaiserslautern-Landau, Kaiserslautern, Rhineland-Palatinate, Germany, German Research Center for Artificial Intelligence, DFKI GmbH, Kaiserslautern, Rhineland-Palatinate, Germany
Abstract:In recent years, the development of robust multi-source models has emerged in the Earth Observation (EO) field. These are models that leverage data from diverse sources to improve predictive accuracy when there is missing data. Despite these advancements, the factors influencing the varying effectiveness of such models remain poorly understood. In this study, we evaluate the predictive performance of six state-of-the-art multi-source models in predicting scenarios where either a single data source is missing or only a single source is available. Our analysis reveals that the efficacy of these models is intricately tied to the nature of the task, the complementarity among data sources, and the model design. Surprisingly, we observe instances where the removal of certain data sources leads to improved predictive performance, challenging the assumption that incorporating all available data is always beneficial. These findings prompt critical reflections on model complexity and the necessity of all collected data sources, potentially shaping the way for more streamlined approaches in EO applications.
Abstract:Diabetic retinopathy is a leading cause of blindness in diabetic patients and early detection plays a crucial role in preventing vision loss. Traditional diagnostic methods are often time-consuming and prone to errors. The emergence of deep learning techniques has provided innovative solutions to improve diagnostic efficiency. However, single deep learning models frequently face issues related to extracting key features from complex retinal images. To handle this problem, we present an effective ensemble method for DR diagnosis comprising four main phases: image pre-processing, selection of backbone pre-trained models, feature enhancement, and optimization. Our methodology initiates with the pre-processing phase, where we apply CLAHE to enhance image contrast and Gamma correction is then used to adjust the brightness for better feature recognition. We then apply Discrete Wavelet Transform (DWT) for image fusion by combining multi-resolution details to create a richer dataset. Then, we selected three pre-trained models with the best performance named DenseNet169, MobileNetV1, and Xception for diverse feature extraction. To further improve feature extraction, an improved residual block is integrated into each model. Finally, the predictions from these base models are then aggregated using weighted ensemble approach, with the weights optimized by using Salp Swarm Algorithm (SSA).SSA intelligently explores the weight space and finds the optimal configuration of base architectures to maximize the performance of the ensemble model. The proposed model is evaluated on the multiclass Kaggle APTOS 2019 dataset and obtained 88.52% accuracy.
Abstract:Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification. However, they often require large amounts of data and can exhibit biases that limit their robustness and generalizability. This paper introduces ForAug, a novel data augmentation scheme that addresses these challenges and explicitly includes inductive biases, which commonly are part of the neural network architecture, into the training data. ForAug is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds, enabling fine-grained control over image composition during training. It thus increases the data diversity and effective number of training samples. We demonstrate that training on ForNet, the application of ForAug to ImageNet, significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet and 7.3 p.p. on downstream tasks. Importantly, ForAug enables novel ways of analyzing model behavior and quantifying biases. Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that training on ForNet substantially reduces these biases compared to training on ImageNet. In summary, ForAug provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models. Our code and dataset are publicly available at https://github.com/tobna/ForAug.
Abstract:Recent advancements in text-to-image (T2I) have improved synthesis results, but challenges remain in layout control and generating omnidirectional panoramic images. Dense T2I (DT2I) and spherical T2I (ST2I) models address these issues, but so far no unified approach exists. Trivial approaches, like prompting a DT2I model to generate panoramas can not generate proper spherical distortions and seamless transitions at the borders. Our work shows that spherical dense text-to-image (SDT2I) can be achieved by integrating training-free DT2I approaches into finetuned panorama models. Specifically, we propose MultiStitchDiffusion (MSTD) and MultiPanFusion (MPF) by integrating MultiDiffusion into StitchDiffusion and PanFusion, respectively. Since no benchmark for SDT2I exists, we further construct Dense-Synthetic-View (DSynView), a new synthetic dataset containing spherical layouts to evaluate our models. Our results show that MSTD outperforms MPF across image quality as well as prompt- and layout adherence. MultiPanFusion generates more diverse images but struggles to synthesize flawless foreground objects. We propose bootstrap-coupling and turning off equirectangular perspective-projection attention in the foreground as an improvement of MPF.
Abstract:Human-in-the-loop (HITL) frameworks are increasingly recognized for their potential to improve annotation accuracy in emotion estimation systems by combining machine predictions with human expertise. This study focuses on integrating a high-performing image-based emotion model into a HITL annotation framework to evaluate the collaborative potential of human-machine interaction and identify the psychological and practical factors critical to successful collaboration. Specifically, we investigate how varying model reliability and cognitive framing influence human trust, cognitive load, and annotation behavior in HITL systems. We demonstrate that model reliability and psychological framing significantly impact annotators' trust, engagement, and consistency, offering insights into optimizing HITL frameworks. Through three experimental scenarios with 29 participants--baseline model reliability (S1), fabricated errors (S2), and cognitive bias introduced by negative framing (S3)--we analyzed behavioral and qualitative data. Reliable predictions in S1 yielded high trust and annotation consistency, while unreliable outputs in S2 led to increased critical evaluations but also heightened frustration and response variability. Negative framing in S3 revealed how cognitive bias influenced participants to perceive the model as more relatable and accurate, despite misinformation regarding its reliability. These findings highlight the importance of both reliable machine outputs and psychological factors in shaping effective human-machine collaboration. By leveraging the strengths of both human oversight and automated systems, this study establishes a scalable HITL framework for emotion annotation and lays the foundation for broader applications in adaptive learning and human-computer interaction.
Abstract:Dataset distillation is the concept of condensing large datasets into smaller but highly representative synthetic samples. While previous research has primarily focused on image classification, its application to image Super-Resolution (SR) remains underexplored. This exploratory work studies multiple dataset distillation techniques applied to SR, including pixel- and latent-space approaches under different aspects. Our experiments demonstrate that a 91.12% dataset size reduction can be achieved while maintaining comparable SR performance to the full dataset. We further analyze initialization strategies and distillation methods to optimize memory efficiency and computational costs. Our findings provide new insights into dataset distillation for SR and set the stage for future advancements.
Abstract:Satellite imagery is a cornerstone for numerous Remote Sensing (RS) applications; however, limited spatial resolution frequently hinders the precision of such systems, especially in multi-label scene classification tasks as it requires a higher level of detail and feature differentiation. In this study, we explore the efficacy of image Super-Resolution (SR) as a pre-processing step to enhance the quality of satellite images and thus improve downstream classification performance. We investigate four SR models - SRResNet, HAT, SeeSR, and RealESRGAN - and evaluate their impact on multi-label scene classification across various CNN architectures, including ResNet-50, ResNet-101, ResNet-152, and Inception-v4. Our results show that applying SR significantly improves downstream classification performance across various metrics, demonstrating its ability to preserve spatial details critical for multi-label tasks. Overall, this work offers valuable insights into the selection of SR techniques for multi-label prediction in remote sensing and presents an easy-to-integrate framework to improve existing RS systems.
Abstract:Multi-view learning (MVL) leverages multiple sources or views of data to enhance machine learning model performance and robustness. This approach has been successfully used in the Earth Observation (EO) domain, where views have a heterogeneous nature and can be affected by missing data. Despite the negative effect that missing data has on model predictions, the ML literature has used it as an augmentation technique to improve model generalization, like masking the input data. Inspired by this, we introduce novel methods for EO applications tailored to MVL with missing views. Our methods integrate the combination of a set to simulate all combinations of missing views as different training samples. Instead of replacing missing data with a numerical value, we use dynamic merge functions, like average, and more complex ones like Transformer. This allows the MVL model to entirely ignore the missing views, enhancing its predictive robustness. We experiment on four EO datasets with temporal and static views, including state-of-the-art methods from the EO domain. The results indicate that our methods improve model robustness under conditions of moderate missingness, and improve the predictive performance when all views are present. The proposed methods offer a single adaptive solution to operate effectively with any combination of available views.
Abstract:In response to climate change, assessing crop productivity under extreme weather conditions is essential to enhance food security. Crop simulation models, which align with physical processes, offer explainability but often perform poorly. Conversely, machine learning (ML) models for crop modeling are powerful and scalable yet operate as black boxes and lack adherence to crop growths physical principles. To bridge this gap, we propose a novel method that combines the strengths of both approaches by estimating the water use and the crop sensitivity to water scarcity at the pixel level. This approach enables yield loss estimation grounded in physical principles by sequentially solving the equation for crop yield response to water scarcity, using an enhanced loss function. Leveraging Sentinel-2 satellite imagery, climate data, simulated water use data, and pixel-level yield data, our model demonstrates high accuracy, achieving an R2 of up to 0.77, matching or surpassing state-of-the-art models like RNNs and Transformers. Additionally, it provides interpretable and physical consistent outputs, supporting industry, policymakers, and farmers in adapting to extreme weather conditions.
Abstract:Document classification is considered a critical element in automated document processing systems. In recent years multi-modal approaches have become increasingly popular for document classification. Despite their improvements, these approaches are underutilized in the industry due to their requirement for a tremendous volume of training data and extensive computational power. In this paper, we attempt to address these issues by embedding textual features directly into the visual space, allowing lightweight image-based classifiers to achieve state-of-the-art results using small-scale datasets in document classification. To evaluate the efficacy of the visual features generated from our approach on limited data, we tested on the standard dataset Tobacco-3482. Our experiments show a tremendous improvement in image-based classifiers, achieving an improvement of 4.64% using ResNet50 with no document pre-training. It also sets a new record for the best accuracy of the Tobacco-3482 dataset with a score of 91.14% using the image-based DocXClassifier with no document pre-training. The simplicity of the approach, its resource requirements, and subsequent results provide a good prospect for its use in industrial use cases.