Abstract:Using images acquired by different satellite sensors has shown to improve classification performance in the framework of crop mapping from satellite image time series (SITS). Existing state-of-the-art architectures use self-attention mechanisms to process the temporal dimension and convolutions for the spatial dimension of SITS. Motivated by the success of purely attention-based architectures in crop mapping from single-modal SITS, we introduce several multi-modal multi-temporal transformer-based architectures. Specifically, we investigate the effectiveness of Early Fusion, Cross Attention Fusion and Synchronized Class Token Fusion within the Temporo-Spatial Vision Transformer (TSViT). Experimental results demonstrate significant improvements over state-of-the-art architectures with both convolutional and self-attention components.
Abstract:Transfer learning allows for resource-efficient geographic transfer of pre-trained field delineation models. However, the scarcity of labeled data for complex and dynamic smallholder landscapes, particularly in Sub-Saharan Africa, remains a major bottleneck for large-area field delineation. This study explores opportunities of using sparse field delineation pseudo labels for fine-tuning models across geographies and sensor characteristics. We build on a FracTAL ResUNet trained for crop field delineation in India (median field size of 0.24 ha) and use this pre-trained model to generate pseudo labels in Mozambique (median field size of 0.06 ha). We designed multiple pseudo label selection strategies and compared the quantities, area properties, seasonal distribution, and spatial agreement of the pseudo labels against human-annotated training labels (n = 1,512). We then used the human-annotated labels and the pseudo labels for model fine-tuning and compared predictions against human field annotations (n = 2,199). Our results indicate i) a good baseline performance of the pre-trained model in both field delineation and field size estimation, and ii) the added value of regional fine-tuning with performance improvements in nearly all experiments. Moreover, we found iii) substantial performance increases when using only pseudo labels (up to 77% of the IoU increases and 68% of the RMSE decreases obtained by human labels), and iv) additional performance increases when complementing human annotations with pseudo labels. Pseudo labels can be efficiently generated at scale and thus facilitate domain adaptation in label-scarce settings. The workflow presented here is a stepping stone for overcoming the persisting data gaps in heterogeneous smallholder agriculture of Sub-Saharan Africa, where labels are commonly scarce.