Abstract:This short study presents an opportunistic approach to a (more) reliable validation method for prediction uncertainty average calibration. Considering that variance-based calibration metrics (ZMS, NLL, RCE...) are quite sensitive to the presence of heavy tails in the uncertainty and error distributions, a shift is proposed to an interval-based metric, the Prediction Interval Coverage Probability (PICP). It is shown on a large ensemble of molecular properties datasets that (1) sets of z-scores are well represented by Student's-$t(\nu)$ distributions, $\nu$ being the number of degrees of freedom; (2) accurate estimation of 95 $\%$ prediction intervals can be obtained by the simple $2\sigma$ rule for $\nu>3$; and (3) the resulting PICPs are more quickly and reliably tested than variance-based calibration metrics. Overall, this method enables to test 20 $\%$ more datasets than ZMS testing. Conditional calibration is also assessed using the PICP approach.
Abstract:Some popular Machine Learning Uncertainty Quantification (ML-UQ) calibration statistics do not have predefined reference values and are mostly used in comparative studies. In consequence, calibration is almost never validated and the diagnostic is left to the appreciation of the reader. Simulated reference values, based on synthetic calibrated datasets derived from actual uncertainties, have been proposed to palliate this problem. As the generative probability distribution for the simulation of synthetic errors is often not constrained, the sensitivity of simulated reference values to the choice of generative distribution might be problematic, shedding a doubt on the calibration diagnostic. This study explores various facets of this problem, and shows that some statistics are excessively sensitive to the choice of generative distribution to be used for validation when the generative distribution is unknown. This is the case, for instance, of the correlation coefficient between absolute errors and uncertainties (CC) and of the expected normalized calibration error (ENCE). A robust validation workflow to deal with simulated reference values is proposed.
Abstract:Average calibration of the uncertainties of machine learning regression tasks can be tested in two ways. One way is to estimate the calibration error (CE) as the difference between the mean absolute error (MSE) and the mean variance (MV) or mean squared uncertainty. The alternative is to compare the mean squared z-scores or scaled errors (ZMS) to 1. Both approaches might lead to different conclusion, as illustrated on an ensemble of datasets from the recent machine learning uncertainty quantification literature. It is shown here that the CE is very sensitive to the distribution of uncertainties, and notably to the presence of outlying uncertainties, and that it cannot be used reliably for calibration testing. By contrast, the ZMS statistic does not present this sensitivity issue and offers the most reliable approach in this context. Implications for the validation of conditional calibration are discussed.
Abstract:Binwise Variance Scaling (BVS) has recently been proposed as a post hoc recalibration method for prediction uncertainties of machine learning regression problems that is able of more efficient corrections than uniform variance (or temperature) scaling. The original version of BVS uses uncertainty-based binning, which is aimed to improve calibration conditionally on uncertainty, i.e. consistency. I explore here several adaptations of BVS, in particular with alternative loss functions and a binning scheme based on an input-feature (X) in order to improve adaptivity, i.e. calibration conditional on X. The performances of BVS and its proposed variants are tested on a benchmark dataset for the prediction of atomization energies and compared to the results of isotonic regression.
Abstract:Reliable uncertainty quantification (UQ) in machine learning (ML) regression tasks is becoming the focus of many studies in materials and chemical science. It is now well understood that average calibration is insufficient, and most studies implement additional methods testing the conditional calibration with respect to uncertainty, i.e. consistency. Consistency is assessed mostly by so-called reliability diagrams. There exists however another way beyond average calibration, which is conditional calibration with respect to input features, i.e. adaptivity. In practice, adaptivity is the main concern of the final users of a ML-UQ method, seeking for the reliability of predictions and uncertainties for any point in features space. This article aims to show that consistency and adaptivity are complementary validation targets, and that a good consistency does not imply a good adaptivity. Adapted validation methods are proposed and illustrated on a representative example.
Abstract:Abstract Post hoc recalibration of prediction uncertainties of machine learning regression problems by isotonic regression might present a problem for bin-based calibration error statistics (e.g. ENCE). Isotonic regression often produces stratified uncertainties, i.e. subsets of uncertainties with identical numerical values. Partitioning of the resulting data into equal-sized bins introduces an aleatoric component to the estimation of bin-based calibration statistics. The partitioning of stratified data into bins depends on the order of the data, which is typically an uncontrolled property of calibration test/validation sets. The tie-braking method of the ordering algorithm used for binning might also introduce an aleatoric component. I show on an example how this might significantly affect the calibration diagnostics.
Abstract:The Expected Normalized Calibration Error (ENCE) is a popular calibration statistic used in Machine Learning to assess the quality of prediction uncertainties for regression problems. Estimation of the ENCE is based on the binning of calibration data. In this short note, I illustrate an annoying property of the ENCE, i.e. its proportionality to the square root of the number of bins for well calibrated or nearly calibrated datasets. A similar behavior affects the calibration error based on the variance of z-scores (ZVE), and in both cases this property is a consequence of the use of a Mean Absolute Deviation (MAD) statistic to estimate calibration errors. Hence, the question arises of which number of bins to choose for a reliable estimation of calibration error statistics. A solution is proposed to infer ENCE and ZVE values that do not depend on the number of bins for datasets assumed to be calibrated, providing simultaneously a statistical calibration test. It is also shown that the ZVE is less sensitive than the ENCE to outstanding errors or uncertainties.
Abstract:The practice of uncertainty quantification (UQ) validation, notably in machine learning for the physico-chemical sciences, rests on several graphical methods (scattering plots, calibration curves, reliability diagrams and confidence curves) which explore complementary aspects of calibration, without covering all the desirable ones. For instance, none of these methods deals with the reliability of UQ metrics across the range of input features (adaptivity). Based on the complementary concepts of consistency and adaptivity, the toolbox of common validation methods for variance- and intervals- based UQ metrics is revisited with the aim to provide a better grasp on their capabilities. This study is conceived as an introduction to UQ validation, and all methods are derived from a few basic rules. The methods are illustrated and tested on synthetic datasets and representative examples extracted from the recent physico-chemical machine learning UQ literature.
Abstract:PURPOSE: To develop an automated algorithm allowing extraction of quantitative corneal transparency parameters from clinical spectral-domain OCT images. To establish a representative dataset of normative transparency values from healthy corneas. METHODS: SD-OCT images of 83 normal corneas (ages 22-50 years) from a standard clinical device (RTVue-XR Avanti, Optovue Inc.) were processed. A pre-processing procedure is applied first, including a derivative approach and a PCA-based correction mask, to eliminate common central artifacts (i.e., apex-centered column saturation artifact and posterior stromal artifact) and enable standardized analysis. The mean intensity stromal-depth profile is then extracted over a 6-mm-wide corneal area and analyzed according to our previously developed method deriving quantitative transparency parameters related to the physics of light propagation in tissues, notably tissular heterogeneity (Birge ratio; $B_r$), followed by the photon mean-free path ($l_s$) in homogeneous tissues (i.e., $B_r \sim 1$). RESULTS: After confirming stromal homogeneity ($B_r < 10$, IDR: 1.9-5.1), we measured a median $l_s$ of 570 $\mu$m (IDR: 270-2400 $\mu$m). Considering corneal thicknesses, this may be translated into a median fraction of transmitted (coherent) light $T_{coh(stroma)}$ of 51$\%$ (IDR: 22-83$\%$). No statistically significant correlation between transparency and age or thickness was found. CONCLUSIONS: Our algorithm provides robust and quantitative measurement of corneal transparency from standard clinical SD-OCT images. It yields lower transparency values than previously reported, which may be attributed to our method being exclusively sensitive to spatially coherent light. Excluding images with central artifacts wider than 300 $\mu$m also raises our median $T_{coh(stroma)}$ to 70$\%$ (IDR: 34-87$\%$).