Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tilmann Gneiting

Probabilistic measures afford fair comparisons of AIWP and NWP model output

Jun 04, 2025

Tilmann Gneiting, Tobias Biegert, Kristof Kraus, Eva-Maria Walz, Alexander I. Jordan, Sebastian Lerch

Abstract:We introduce a new measure for fair and meaningful comparisons of single-valued output from artificial intelligence based weather prediction (AIWP) and numerical weather prediction (NWP) models, called potential continuous ranked probability score (PC). In a nutshell, we subject the deterministic backbone of physics-based and data-driven models post hoc to the same statistical postprocessing technique, namely, isotonic distributional regression (IDR). Then we find PC as the mean continuous ranked probability score (CRPS) of the postprocessed probabilistic forecasts. The nonnegative PC measure quantifies potential predictive performance and is invariant under strictly increasing transformations of the model output. PC attains its most desirable value of zero if, and only if, the weather outcome Y is a fixed, non-decreasing function of the model output X. The PC measure is recorded in the unit of the outcome, has an upper bound of one half times the mean absolute difference between outcomes, and serves as a proxy for the mean CRPS of real-time, operational probabilistic products. When applied to WeatherBench 2 data, our approach demonstrates that the data-driven GraphCast model outperforms the leading, physics-based European Centre for Medium Range Weather Forecasts (ECMWF) high-resolution (HRES) model. Furthermore, the PC measure for the HRES model aligns exceptionally well with the mean CRPS of the operational ECMWF ensemble. Across application domains, our approach affords comparisons of single-valued forecasts in settings where the pre-specification of a loss function -- which is the usual, and principally superior, procedure in forecast contests, administrative, and benchmarks settings -- places competitors on unequal footings.

Via

Access Paper or Ask Questions

Evaluating Probabilistic Classifiers: The Triptych

Jan 25, 2023

Timo Dimitriadis, Tilmann Gneiting, Alexander I. Jordan, Peter Vogel

Abstract:Probability forecasts for binary outcomes, often referred to as probabilistic classifiers or confidence scores, are ubiquitous in science and society, and methods for evaluating and comparing them are in great demand. We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance: The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value. A Murphy curve shows a forecast's mean elementary scores, including the widely used misclassification rate, and the area under a Murphy curve equals the mean Brier score. For a calibrated forecast, the reliability curve lies on the diagonal, and for competing calibrated forecasts, the ROC and Murphy curves share the same number of crossing points. We invoke the recently developed CORP (Consistent, Optimally binned, Reproducible, and Pool-Adjacent-Violators (PAV) algorithm based) approach to craft reliability diagrams and decompose a mean score into miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) components. Plots of the DSC measure of discrimination ability versus the calibration metric MCB visualize classifier performance across multiple competitors. The proposed tools are illustrated in empirical examples from astrophysics, economics, and social science.

Via

Access Paper or Ask Questions

Evaluating probabilistic classifiers: Reliability diagrams and score decompositions revisited

Aug 07, 2020

Timo Dimitriadis, Tilmann Gneiting, Alexander I. Jordan

Figure 1 for Evaluating probabilistic classifiers: Reliability diagrams and score decompositions revisited

Figure 2 for Evaluating probabilistic classifiers: Reliability diagrams and score decompositions revisited

Figure 3 for Evaluating probabilistic classifiers: Reliability diagrams and score decompositions revisited

Figure 4 for Evaluating probabilistic classifiers: Reliability diagrams and score decompositions revisited

Abstract:A probability forecast or probabilistic classifier is reliable or calibrated if the predicted probabilities are matched by ex post observed frequencies, as examined visually in reliability diagrams. The classical binning and counting approach to plotting reliability diagrams has been hampered by a lack of stability under unavoidable, ad hoc implementation decisions. Here we introduce the CORP approach, which generates provably statistically Consistent, Optimally binned, and Reproducible reliability diagrams in an automated way. CORP is based on non-parametric isotonic regression and implemented via the Pool-adjacent-violators (PAV) algorithm - essentially, the CORP reliability diagram shows the graph of the PAV- (re)calibrated forecast probabilities. The CORP approach allows for uncertainty quantification via either resampling techniques or asymptotic theory, furnishes a new numerical measure of miscalibration, and provides a CORP based Brier score decomposition that generalizes to any proper scoring rule. We anticipate that judicious uses of the PAV algorithm yield improved tools for diagnostics and inference for a very wide range of statistical and machine learning methods.

Via

Access Paper or Ask Questions

ROC movies -- a new generalization to a popular classic

Nov 29, 2019

Tilmann Gneiting, Eva-Maria Walz

Figure 1 for ROC movies -- a new generalization to a popular classic

Figure 2 for ROC movies -- a new generalization to a popular classic

Figure 3 for ROC movies -- a new generalization to a popular classic

Figure 4 for ROC movies -- a new generalization to a popular classic

Abstract:Throughout science and technology, receiver operating characteristic (ROC) curves and associated area under the curve (AUC) measures constitute powerful tools for assessing the predictive abilities of features, markers and tests in binary classification problems. Despite its immense popularity, ROC analysis has been subject to a fundamental restriction, in that it applies to dichotomous (yes or no) outcomes only. We introduce ROC movies and universal ROC (UROC) curves that apply to just any ordinal or real-valued outcome, along with a new, asymmetric coefficient of predictive ability (CPA) measure. CPA equals the area under the UROC curve and admits appealing interpretations in terms of probabilities and rank based covariances. ROC movies, UROC curves and CPA nest and generalize the classical ROC curve and AUC, and are bound to supersede them in a wealth of applications.

Via

Access Paper or Ask Questions

Of Quantiles and Expectiles: Consistent Scoring Functions, Choquet Representations, and Forecast Rankings

Apr 17, 2015

Werner Ehm, Tilmann Gneiting, Alexander Jordan, Fabian Krüger

Figure 1 for Of Quantiles and Expectiles: Consistent Scoring Functions, Choquet Representations, and Forecast Rankings

Figure 2 for Of Quantiles and Expectiles: Consistent Scoring Functions, Choquet Representations, and Forecast Rankings

Figure 3 for Of Quantiles and Expectiles: Consistent Scoring Functions, Choquet Representations, and Forecast Rankings

Figure 4 for Of Quantiles and Expectiles: Consistent Scoring Functions, Choquet Representations, and Forecast Rankings

Abstract:In the practice of point prediction, it is desirable that forecasters receive a directive in the form of a statistical functional, such as the mean or a quantile of the predictive distribution. When evaluating and comparing competing forecasts, it is then critical that the scoring function used for these purposes be consistent for the functional at hand, in the sense that the expected score is minimized when following the directive. We show that any scoring function that is consistent for a quantile or an expectile functional, respectively, can be represented as a mixture of extremal scoring functions that form a linearly parameterized family. Scoring functions for the mean value and probability forecasts of binary events constitute important examples. The quantile and expectile functionals along with the respective extremal scoring functions admit appealing economic interpretations in terms of thresholds in decision making. The Choquet type mixture representations give rise to simple checks of whether a forecast dominates another in the sense that it is preferable under any consistent scoring function. In empirical settings it suffices to compare the average scores for only a finite number of extremal elements. Plots of the average scores with respect to the extremal scoring functions, which we call Murphy diagrams, permit detailed comparisons of the relative merits of competing forecasts.

* References updated; a few minor edits in response to initial comments (merely for clarity)

Via

Access Paper or Ask Questions

On the Cover-Hart Inequality: What's a Sample of Size One Worth?

Jun 15, 2012

Tilmann Gneiting

Figure 1 for On the Cover-Hart Inequality: What's a Sample of Size One Worth?

Abstract:Bob predicts a future observation based on a sample of size one. Alice can draw a sample of any size before issuing her prediction. How much better can she do than Bob? Perhaps surprisingly, under a large class of loss functions, which we refer to as the Cover-Hart family, the best Alice can do is to halve Bob's risk. In this sense, half the information in an infinite sample is contained in a sample of size one. The Cover-Hart family is a convex cone that includes metrics and negative definite functions, subject to slight regularity conditions. These results may help explain the small relative differences in empirical performance measures in applied classification and forecasting problems, as well as the success of reasoning and learning by analogy in general, and nearest neighbor techniques in particular.

Via

Access Paper or Ask Questions