Abstract:This work aims to understand the impact of class imbalance on the performance of chest x-ray classifiers, in light of the standard evaluation practices adopted by researchers in terms of discrimination and calibration performance. Firstly, we conducted a literature study to analyze common scientific practices and confirmed that: (1) even when dealing with highly imbalanced datasets, the community tends to use metrics that are dominated by the majority class; and (2) it is still uncommon to include calibration studies for chest x-ray classifiers, albeit its importance in the context of healthcare. Secondly, we perform a systematic experiment on two major chest x-ray datasets to explore the behavior of several performance metrics under different class ratios and show that widely adopted metrics can conceal the performance in the minority class. Finally, we propose the adoption of two alternative metrics, the precision-recall curve and the Balanced Brier score, which better reflect the performance of the system in such scenarios. Our results indicate that current evaluation practices adopted by the research community for chest x-ray classifiers may not reflect the performance of such systems for computer-aided diagnosis in real clinical scenarios, and suggest alternatives to improve this situation.