Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Unraveling overoptimism and publication bias in ML-driven science

May 23, 2024

Pouria Saidi, Gautam Dasarathy, Visar Berisha

Figure 1 for Unraveling overoptimism and publication bias in ML-driven science

Figure 2 for Unraveling overoptimism and publication bias in ML-driven science

Figure 3 for Unraveling overoptimism and publication bias in ML-driven science

Figure 4 for Unraveling overoptimism and publication bias in ML-driven science

Share this with someone who'll enjoy it:

Abstract:Machine Learning (ML) is increasingly used across many disciplines with impressive reported results across many domain areas. However, recent studies suggest that the published performance of ML models are often overoptimistic and not reflective of true accuracy were these models to be deployed. Validity concerns are underscored by findings of a concerning inverse relationship between sample size and reported accuracy in published ML models across several domains. This is in contrast with the theory of learning curves in ML, where we expect accuracy to improve or stay the same with increasing sample size. This paper investigates the factors contributing to overoptimistic accuracy reports in ML-based science, focusing on data leakage and publication bias. Our study introduces a novel stochastic model for observed accuracy, integrating parametric learning curves and the above biases. We then construct an estimator based on this model that corrects for these biases in observed data. Theoretical and empirical results demonstrate that this framework can estimate the underlying learning curve that gives rise to the observed overoptimistic results, thereby providing more realistic performance assessments of ML performance from a collection of published results. We apply the model to various meta-analyses in the digital health literature, including neuroimaging-based and speech-based classifications of several neurological conditions. Our results indicate prevalent overoptimism across these fields and we estimate the inherent limits of ML-based prediction in each domain.

* 31 pages, 7 figures, 6 tables

View paper on

Share this with someone who'll enjoy it:

Title:Unraveling overoptimism and publication bias in ML-driven science

Paper and Code