Abstract:Machine learning (ML) classification models are increasingly being used in a wide range of applications where it is important that predictions are accompanied by uncertainties, including in climate and earth observation, medical diagnosis and bioaerosol monitoring. The output of an ML classification model is a type of categorical variable known as a nominal property in the International Vocabulary of Metrology (VIM). However, concepts related to uncertainty evaluation for nominal properties are not defined in the VIM, nor is such evaluation addressed by the Guide to the Expression of Uncertainty in Measurement (GUM). In this paper we propose a metrological conceptual uncertainty evaluation framework for ML classification, and illustrate its use in the context of two applications that exemplify the issues and have significant societal impact, namely, climate and earth observation and medical diagnosis. Our framework would enable an extension of the VIM and GUM to uncertainty for nominal properties, which would make both applicable to ML classification models.
Abstract:Land cover classification involves the production of land cover maps, which determine the type of land through remote sensing imagery. Over recent years, such classification is being performed by machine learning classification models, which can give highly accurate predictions on land cover per pixel using large quantities of input training data. However, such models do not currently take account of input measurement uncertainty, which is vital for traceability in metrology. In this work we propose a Bayesian classification framework using generative modelling to take account of input measurement uncertainty. We take the specific case of Bayesian quadratic discriminant analysis, and apply it to land cover datasets from Copernicus Sentinel-2 in 2020 and 2021. We benchmark the performance of the model against more popular classification models used in land cover maps such as random forests and neural networks. We find that such Bayesian models are more trustworthy, in the sense that they are more interpretable, explicitly model the input measurement uncertainty, and maintain predictive performance of class probability outputs across datasets of different years and sizes, whilst also being computationally efficient.