Abstract:We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code is open-sourced under a permissive license.
Abstract:Performing accurate confidence quantification and assessment is important for deep neural networks to predict their failures, improve their performance and enhance their capabilities in real-world applications, for their practical deployment in real life. For pixel-wise regression tasks, confidence quantification and assessment has not been well addressed in the literature, in contrast to classification tasks like semantic segmentation. The softmax output layer is not used in deep neural networks that solve pixel-wise regression problems. In this paper, to address these problems, we develop, train and evaluate the proposed model Confidence-Aware Regression Estimation (CARE). Our model CARE computes and assigns confidence to regression output results. We focus on solving regression problems as downstream tasks of an AI Foundation Model for Earth Observation (EO). We evaluate the proposed model CARE and experimental results on data from the Copernicus Sentinel-2 satellite constellation for estimating the density of buildings show that the proposed method can be successfully applied to regression problems. We also show that our approach outperforms other methods.
Abstract:When we are primarily interested in solving several problems jointly with a given prescribed high performance accuracy for each target application, then Foundation Models should for most cases be used rather than problem-specific models. We focus on the specific Computer Vision application of Foundation Models for Earth Observation (EO) and geospatial AI. These models can solve important problems we are tackling, including for example land cover classification, crop type mapping, flood segmentation, building density estimation, and road regression segmentation. In this paper, we show that for a limited number of labelled data, Foundation Models achieve improved performance compared to problem-specific models. In this work, we also present our proposed evaluation benchmark for Foundation Models for EO. Benchmarking the generalization performance of Foundation Models is important as it has become difficult to standardize a fair comparison across the many different models that have been proposed recently. We present the results using our evaluation benchmark for EO Foundation Models and show that Foundation Models are label efficient in the downstream tasks and help us solve problems we are tackling in EO and remote sensing.