With model trustworthiness being crucial for sensitive real-world applications, practitioners are putting more and more focus on evaluating deep neural networks in terms of uncertainty calibration. Calibration errors are designed to quantify the reliability of probabilistic predictions but their estimators are usually biased and inconsistent. In this work, we introduce the framework of proper calibration errors, which relates every calibration error to a proper score and provides a respective upper bound with optimal estimation properties. This upper bound allows us to reliably estimate the calibration improvement of any injective recalibration method in an unbiased manner. We demonstrate that, in contrast to our approach, the most commonly used estimators are substantially biased with respect to the true improvement of recalibration methods.