Abstract:We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code is open-sourced under a permissive license.
Abstract:Large-scale foundation models in Earth Observation can learn versatile, label-efficient representations by leveraging massive amounts of unlabeled data. However, existing public datasets are often limited in scale, geographic coverage, or sensor variety. We introduce TerraMesh, a new globally diverse, multimodal dataset combining optical, synthetic aperture radar, elevation, and land-cover modalities in an Analysis-Ready Data format. TerraMesh includes over 9 million samples with eight spatiotemporal aligned modalities, enabling large-scale pre-training and fostering robust cross-modal correlation learning. We provide detailed data processing steps, comprehensive statistics, and empirical evidence demonstrating improved model performance when pre-trained on TerraMesh. The dataset will be made publicly available with a permissive license.
Abstract:Super-resolution is aimed at reconstructing high-resolution images from low-resolution observations. State-of-the-art approaches underpinned with deep learning allow for obtaining outstanding results, generating images of high perceptual quality. However, it often remains unclear whether the reconstructed details are close to the actual ground-truth information and whether they constitute a more valuable source for image analysis algorithms. In the reported work, we address the latter problem, and we present our efforts toward learning super-resolution algorithms in a task-driven way to make them suitable for generating high-resolution images that can be exploited for automated image analysis. In the reported initial research, we propose a methodological approach for assessing the existing models that perform computer vision tasks in terms of whether they can be used for evaluating super-resolution reconstruction algorithms, as well as training them in a task-driven way. We support our analysis with experimental study and we expect it to establish a solid foundation for selecting appropriate computer vision tasks that will advance the capabilities of real-world super-resolution.
Abstract:Performing accurate confidence quantification and assessment is important for deep neural networks to predict their failures, improve their performance and enhance their capabilities in real-world applications, for their practical deployment in real life. For pixel-wise regression tasks, confidence quantification and assessment has not been well addressed in the literature, in contrast to classification tasks like semantic segmentation. The softmax output layer is not used in deep neural networks that solve pixel-wise regression problems. In this paper, to address these problems, we develop, train and evaluate the proposed model Confidence-Aware Regression Estimation (CARE). Our model CARE computes and assigns confidence to regression output results. We focus on solving regression problems as downstream tasks of an AI Foundation Model for Earth Observation (EO). We evaluate the proposed model CARE and experimental results on data from the Copernicus Sentinel-2 satellite constellation for estimating the density of buildings show that the proposed method can be successfully applied to regression problems. We also show that our approach outperforms other methods.
Abstract:Estimating the construction year of buildings is of great importance for sustainability. Sustainable buildings minimize energy consumption and are a key part of responsible and sustainable urban planning and development to effectively combat climate change. By using Artificial Intelligence (AI) and recently proposed Transformer models, we are able to estimate the construction epoch of buildings from a multi-modal dataset. In this paper, we introduce a new benchmark multi-modal dataset, i.e. the Map your City Dataset (MyCD), containing top-view Very High Resolution (VHR) images, Earth Observation (EO) multi-spectral data from the Copernicus Sentinel-2 satellite constellation, and street-view images in many different cities in Europe, co-localized with respect to the building under study and labelled with the construction epoch. We assess EO generalization performance on new/ previously unseen cities that have been held-out from training and appear only during inference. In this work, we present the community-based data challenge we organized based on MyCD. The ESA AI4EO Challenge MapYourCity was opened in 2024 for 4 months. Here, we present the Top-4 performing models, and the main evaluation results. During inference, the performance of the models using both all three input modalities and only the two top-view modalities, i.e. without the street-view images, is examined. The evaluation results show that the models are effective and can achieve good performance on this difficult real-world task of estimating the age of buildings, even on previously unseen cities, as well as even using only the two top-view modalities (i.e. VHR and Sentinel-2) during inference.
Abstract:Satellite-based onboard data processing is crucial for time-sensitive applications requiring timely and efficient rapid response. Advances in edge artificial intelligence are shifting computational power from ground-based centers to on-orbit platforms, transforming the "sensing-communication-decision-feedback" cycle and reducing latency from acquisition to delivery. The current research presents a framework addressing the strict bandwidth, energy, and latency constraints of small satellites, focusing on maritime monitoring. The study contributes three main innovations. Firstly, it investigates the application of deep learning techniques for direct ship detection and classification from raw satellite imagery. By simplifying the onboard processing chain, our approach facilitates direct analyses without requiring computationally intensive steps such as calibration and ortho-rectification. Secondly, to address the scarcity of raw satellite data, we introduce two novel datasets, VDS2Raw and VDV2Raw, which are derived from raw data from Sentinel-2 and Vegetation and Environment Monitoring New Micro Satellite (VENuS) missions, respectively, and enriched with Automatic Identification System (AIS) records. Thirdly, we characterize the tasks' optimal single and multiple spectral band combinations through statistical and feature-based analyses validated on both datasets. In sum, we demonstrate the feasibility of the proposed method through a proof-of-concept on CubeSat-like hardware, confirming the models' potential for operational satellite-based maritime monitoring.
Abstract:Floods are among the most common and devastating natural hazards, imposing immense costs on our society and economy due to their disastrous consequences. Recent progress in weather prediction and spaceborne flood mapping demonstrated the feasibility of anticipating extreme events and reliably detecting their catastrophic effects afterwards. However, these efforts are rarely linked to one another and there is a critical lack of datasets and benchmarks to enable the direct forecasting of flood extent. To resolve this issue, we curate a novel dataset enabling a timely prediction of flood extent. Furthermore, we provide a representative evaluation of state-of-the-art methods, structured into two benchmark tracks for forecasting flood inundation maps i) in general and ii) focused on coastal regions. Altogether, our dataset and benchmark provide a comprehensive platform for evaluating flood forecasts, enabling future solutions for this critical challenge. Data, code & models are shared at https://github.com/Multihuntr/GFF under a CC0 license.
Abstract:In this paper, we address two critical challenges in the domain of flood detection: the computational expense of large-scale time series change detection and the lack of interpretable decision-making processes on explainable AI (XAI). To overcome these challenges, we proposed an interpretable multi-stage approach to flood detection, IMAFD has been proposed. It provides an automatic, efficient and interpretable solution suitable for large-scale remote sensing tasks and offers insight into the decision-making process. The proposed IMAFD approach combines the analysis of the dynamic time series image sequences to identify images with possible flooding with the static, within-image semantic segmentation. It combines anomaly detection (at both image and pixel level) with semantic segmentation. The flood detection problem is addressed through four stages: (1) at a sequence level: identifying the suspected images (2) at a multi-image level: detecting change within suspected images (3) at an image level: semantic segmentation of images into Land, Water or Cloud class (4) decision making. Our contributions are two folder. First, we efficiently reduced the number of frames to be processed for dense change detection by providing a multi-stage holistic approach to flood detection. Second, the proposed semantic change detection method (stage 3) provides human users with an interpretable decision-making process, while most of the explainable AI (XAI) methods provide post hoc explanations. The evaluation of the proposed IMAFD framework was performed on three datasets, WorldFloods, RavAEn and MediaEval. For all the above datasets, the proposed framework demonstrates a competitive performance compared to other methods offering also interpretability and insight.
Abstract:Remote sensing (RS) applications in the space domain demand machine learning (ML) models that are reliable, robust, and quality-assured, making red teaming a vital approach for identifying and exposing potential flaws and biases. Since both fields advance independently, there is a notable gap in integrating red teaming strategies into RS. This paper introduces a methodology for examining ML models operating on hyperspectral images within the HYPERVIEW challenge, focusing on soil parameters' estimation. We use post-hoc explanation methods from the Explainable AI (XAI) domain to critically assess the best performing model that won the HYPERVIEW challenge and served as an inspiration for the model deployed on board the INTUITION-1 hyperspectral mission. Our approach effectively red teams the model by pinpointing and validating key shortcomings, constructing a model that achieves comparable performance using just 1% of the input features and a mere up to 5% performance loss. Additionally, we propose a novel way of visualizing explanations that integrate domain-specific information about hyperspectral bands (wavelengths) and data transformations to better suit interpreting models for hyperspectral image analysis.
Abstract:Artificial intelligence onboard satellites has the potential to reduce data transmission requirements, enable real-time decision-making and collaboration within constellations. This study deploys a lightweight foundational model called RaVAEn on D-Orbit's ION SCV004 satellite. RaVAEn is a variational auto-encoder (VAE) that generates compressed latent vectors from small image tiles, enabling several downstream tasks. In this work we demonstrate the reliable use of RaVAEn onboard a satellite, achieving an encoding time of 0.110s for tiles of a 4.8x4.8 km$^2$ area. In addition, we showcase fast few-shot training onboard a satellite using the latent representation of data. We compare the deployment of the model on the on-board CPU and on the available Myriad vision processing unit (VPU) accelerator. To our knowledge, this work shows for the first time the deployment of a multi-task model on-board a CubeSat and the on-board training of a machine learning model.