CEDRIC - VERTIGO, CNAM
Abstract:Geospatial Foundation Models (GFMs) have emerged as powerful tools for extracting representations from Earth observation data, but their evaluation remains inconsistent and narrow. Existing works often evaluate on suboptimal downstream datasets and tasks, that are often too easy or too narrow, limiting the usefulness of the evaluations to assess the real-world applicability of GFMs. Additionally, there is a distinct lack of diversity in current evaluation protocols, which fail to account for the multiplicity of image resolutions, sensor types, and temporalities, which further complicates the assessment of GFM performance. In particular, most existing benchmarks are geographically biased towards North America and Europe, questioning the global applicability of GFMs. To overcome these challenges, we introduce PANGAEA, a standardized evaluation protocol that covers a diverse set of datasets, tasks, resolutions, sensor modalities, and temporalities. It establishes a robust and widely applicable benchmark for GFMs. We evaluate the most popular GFMs openly available on this benchmark and analyze their performance across several domains. In particular, we compare these models to supervised baselines (e.g. UNet and vanilla ViT), and assess their effectiveness when faced with limited labeled data. Our findings highlight the limitations of GFMs, under different scenarios, showing that they do not consistently outperform supervised models. PANGAEA is designed to be highly extensible, allowing for the seamless inclusion of new datasets, models, and tasks in future research. By releasing the evaluation code and benchmark, we aim to enable other researchers to replicate our experiments and build upon our work, fostering a more principled evaluation protocol for large pre-trained geospatial models. The code is available at https://github.com/VMarsocci/pangaea-bench.
Abstract:Large-scale "foundation models" have gained traction as a way to leverage the vast amounts of unlabeled remote sensing data collected every day. However, due to the multiplicity of Earth Observation satellites, these models should learn "sensor agnostic" representations, that generalize across sensor characteristics with minimal fine-tuning. This is complicated by data availability, as low-resolution imagery, such as Sentinel-2 and Landsat-8 data, are available in large amounts, while very high-resolution aerial or satellite data is less common. To tackle these challenges, we introduce cross-sensor self-supervised training and alignment for remote sensing (X-STARS). We design a self-supervised training loss, the Multi-Sensor Alignment Dense loss (MSAD), to align representations across sensors, even with vastly different resolutions. Our X-STARS can be applied to train models from scratch, or to adapt large models pretrained on e.g low-resolution EO data to new high-resolution sensors, in a continual pretraining framework. We collect and release MSC-France, a new multi-sensor dataset, on which we train our X-STARS models, then evaluated on seven downstream classification and segmentation tasks. We demonstrate that X-STARS outperforms the state-of-the-art by a significant margin with less data across various conditions of data availability and resolutions.
Abstract:This article summarizes principles and ideas from the emerging area of applying \textit{conditional computation} methods to the design of neural networks. In particular, we focus on neural networks that can dynamically activate or de-activate parts of their computational graph conditionally on their input. Examples include the dynamic selection of, e.g., input tokens, layers (or sets of layers), and sub-modules inside each layer (e.g., channels in a convolutional filter). We first provide a general formalism to describe these techniques in an uniform way. Then, we introduce three notable implementations of these principles: mixture-of-experts (MoEs) networks, token selection mechanisms, and early-exit neural networks. The paper aims to provide a tutorial-like introduction to this growing field. To this end, we analyze the benefits of these modular designs in terms of efficiency, explainability, and transfer learning, with a focus on emerging applicative areas ranging from automated scientific discovery to semantic communication.
Abstract:Land cover maps are a pivotal element in a wide range of Earth Observation (EO) applications. However, annotating large datasets to develop supervised systems for remote sensing (RS) semantic segmentation is costly and time-consuming. Unsupervised Domain Adaption (UDA) could tackle these issues by adapting a model trained on a source domain, where labels are available, to a target domain, without annotations. UDA, while gaining importance in computer vision, is still under-investigated in RS. Thus, we propose a new lightweight model, GeoMultiTaskNet, based on two contributions: a GeoMultiTask module (GeoMT), which utilizes geographical coordinates to align the source and target domains, and a Dynamic Class Sampling (DCS) strategy, to adapt the semantic segmentation loss to the frequency of classes. This approach is the first to use geographical metadata for UDA in semantic segmentation. It reaches state-of-the-art performances (47,22% mIoU), reducing at the same time the number of parameters (33M), on a subset of the FLAIR dataset, a recently proposed dataset properly shaped for RS UDA, used for the first time ever for research scopes here.
Abstract:Change detection is one of the most active research areas in Remote Sensing (RS). Most of the recently developed change detection methods are based on deep learning (DL) algorithms. This kind of algorithms is generally focused on generating two-dimensional (2D) change maps, thus only identifying planimetric changes in land use/land cover (LULC) and not considering nor returning any information on the corresponding elevation changes. Our work goes one step further, proposing two novel networks, able to solve simultaneously the 2D and 3D CD tasks, and the 3DCD dataset, a novel and freely available dataset precisely designed for this multitask. Particularly, the aim of this work is to lay the foundations for the development of DL algorithms able to automatically infer an elevation (3D) CD map -- together with a standard 2D CD map --, starting only from a pair of bitemporal optical images. The proposed architectures, to perform the task described before, consist of a transformer-based network, the MultiTask Bitemporal Images Transformer (MTBIT), and a deep convolutional network, the Siamese ResUNet (SUNet). Particularly, MTBIT is a transformer-based architecture, based on a semantic tokenizer. SUNet instead combines, in a siamese encoder, skip connections and residual layers to learn rich features, capable to solve efficiently the proposed task. These models are, thus, able to obtain 3D CD maps from two optical images taken at different time instants, without the need to rely directly on elevation data during the inference step. Encouraging results, obtained on the novel 3DCD dataset, are shown. The code and the 3DCD dataset are available at \url{https://sites.google.com/uniroma1.it/3dchangedetection/home-page}.
Abstract:In the field of Earth Observation (EO), Continual Learning (CL) algorithms have been proposed to deal with large datasets by decomposing them into several subsets and processing them incrementally. The majority of these algorithms assume that data is (a) coming from a single source, and (b) fully labeled. Real-world EO datasets are instead characterized by a large heterogeneity (e.g., coming from aerial, satellite, or drone scenarios), and for the most part they are unlabeled, meaning they can be fully exploited only through the emerging Self-Supervised Learning (SSL) paradigm. For these reasons, in this paper we propose a new algorithm for merging SSL and CL for remote sensing applications, that we call Continual Barlow Twins (CBT). It combines the advantages of one of the simplest self-supervision techniques, i.e., Barlow Twins, with the Elastic Weight Consolidation method to avoid catastrophic forgetting. In addition, for the first time we evaluate SSL methods on a highly heterogeneous EO dataset, showing the effectiveness of these strategies on a novel combination of three almost non-overlapping domains datasets (airborne Potsdam dataset, satellite US3D dataset, and drone UAVid dataset), on a crucial downstream task in EO, i.e., semantic segmentation. Encouraging results show the superiority of SSL in this setting, and the effectiveness of creating an incremental effective pretrained feature extractor, based on ResNet50, without the need of relying on the complete availability of all the data, with a valuable saving of time and resources.