Abstract:With the rise of electronic data, particularly Earth observation data, data-based geospatial modelling using machine learning (ML) has gained popularity in environmental research. Accurate geospatial predictions are vital for domain research based on ecosystem monitoring and quality assessment and for policy-making and action planning, considering effective management of natural resources. The accuracy and computation speed of ML has generally proved efficient. However, many questions have yet to be addressed to obtain precise and reproducible results suitable for further use in both research and practice. A better understanding of the ML concepts applicable to geospatial problems enhances the development of data science tools providing transparent information crucial for making decisions on global challenges such as biosphere degradation and climate change. This survey reviews common nuances in geospatial modelling, such as imbalanced data, spatial autocorrelation, prediction errors, model generalisation, domain specificity, and uncertainty estimation. We provide an overview of techniques and popular programming tools to overcome or account for the challenges. We also discuss prospects for geospatial Artificial Intelligence in environmental applications.
Abstract:In machine learning models, the estimation of errors is often complex due to distribution bias, particularly in spatial data such as those found in environmental studies. We introduce an approach based on the ideas of importance sampling to obtain an unbiased estimate of the target error. By taking into account difference between desirable error and available data, our method reweights errors at each sample point and neutralizes the shift. Importance sampling technique and kernel density estimation were used for reweighteing. We validate the effectiveness of our approach using artificial data that resemble real-world spatial datasets. Our findings demonstrate advantages of the proposed approach for the estimation of the target error, offering a solution to a distribution shift problem. Overall error of predictions dropped from 7% to just 2% and it gets smaller for larger samples.