Abstract:Convolutional neural networks (CNNs) and vision transformers (ViTs) have become essential in computer vision for local and global feature extraction. However, aggregating these architectures in existing methods often results in inefficiencies. To address this, the CNN-Transformer Aggregation Network (CTA-Net) was developed. CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features. This integration enables efficient processing of detailed local and broader contextual information. CTA-Net introduces the Light Weight Multi-Scale Feature Fusion Multi-Head Self-Attention (LMF-MHSA) module for effective multi-scale feature integration with reduced parameters. Additionally, the Reverse Reconstruction CNN-Variants (RRCV) module enhances the embedding of CNNs within the transformer architecture. Extensive experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance (TOP-1 Acc 86.76\%), fewer parameters (20.32M), and greater efficiency (FLOPs 2.83B), making it a highly efficient and lightweight solution for visual tasks on small-scale datasets (fewer than 100,000).
Abstract:More accurate, spatio-temporally, and physically consistent LST estimation has been a main interest in Earth system research. Developing physics-driven mechanism models and data-driven machine learning (ML) models are two major paradigms for gapless LST estimation, which have their respective advantages and disadvantages. In this paper, a physics-constrained ML model, which combines the strengths in the mechanism model and ML model, is proposed to generate gapless LST with physical meanings and high accuracy. The hybrid model employs ML as the primary architecture, under which the input variable physical constraints are incorporated to enhance the interpretability and extrapolation ability of the model. Specifically, the light gradient-boosting machine (LGBM) model, which uses only remote sensing data as input, serves as the pure ML model. Physical constraints (PCs) are coupled by further incorporating key Community Land Model (CLM) forcing data (cause) and CLM simulation data (effect) as inputs into the LGBM model. This integration forms the PC-LGBM model, which incorporates surface energy balance (SEB) constraints underlying the data in CLM-LST modeling within a biophysical framework. Compared with a pure physical method and pure ML methods, the PC-LGBM model improves the prediction accuracy and physical interpretability of LST. It also demonstrates a good extrapolation ability for the responses to extreme weather cases, suggesting that the PC-LGBM model enables not only empirical learning from data but also rationally derived from theory. The proposed method represents an innovative way to map accurate and physically interpretable gapless LST, and could provide insights to accelerate knowledge discovery in land surface processes and data mining in geographical parameter estimation.
Abstract:Land surface temperature (LST) is a key parameter when monitoring land surface processes. However, cloud contamination and the tradeoff between the spatial and temporal resolutions greatly impede the access to high-quality thermal infrared (TIR) remote sensing data. Despite the massive efforts made to solve these dilemmas, it is still difficult to generate LST estimates with concurrent spatial completeness and a high spatio-temporal resolution. Land surface models (LSMs) can be used to simulate gapless LST with a high temporal resolution, but this usually comes with a low spatial resolution. In this paper, we present an integrated temperature fusion framework for satellite-observed and LSM-simulated LST data to map gapless LST at a 60-m spatial resolution and half-hourly temporal resolution. The global linear model (GloLM) model and the diurnal land surface temperature cycle (DTC) model are respectively performed as preprocessing steps for sensor and temporal normalization between the different LST data. The Landsat LST, Moderate Resolution Imaging Spectroradiometer (MODIS) LST, and Community Land Model Version 5.0 (CLM 5.0)-simulated LST are then fused using a filter-based spatio-temporal integrated fusion model. Evaluations were implemented in an urban-dominated region (the city of Wuhan in China) and a natural-dominated region (the Heihe River Basin in China), in terms of accuracy, spatial variability, and diurnal temporal dynamics. Results indicate that the fused LST is highly consistent with actual Landsat LST data (in situ LST measurements), in terms of a Pearson correlation coefficient of 0.94 (0.97-0.99), a mean absolute error of 0.71-0.98 K (0.82-3.17 K), and a root-mean-square error of 0.97-1.26 K (1.09-3.97 K).